I have crawled a page with both English and Russian (I think) content into my index but can't seem to get search results when using a Russian search term.
The page is: http://englishrussia.com/?p=845 The search term is: воды The term appears in one of the comments ('Comment by Henry'). I've dumped the segment in which the page content is stored, and the correct UTF-8 characters are stored there so it seems the fetch was fine. This is, of course, only an example; I've had similar results with different terms and other similar pages. I don't know Russian, but have tried enough different words that I think I am not using the equivalent of "the" as a search term. I had been having issues with character encodings in the servlet, but seem to have worked those out, and as far as I can tell by adding some extra logging to the search servlet that the Query object built by the parser is correct. Can Nutch (or is the problem with Lucene?) support this kind of searching into mixed language content? How can I make this work? Thanks.
