Do the check _before_ indexing. Use https://code.google.com/p/language-detection/ to verify the language of the text document before you put it in the index.
-Glen Newton http://zzzoot.blogspot.com/ On Mon, Feb 27, 2012 at 10:53 AM, Ilya Zavorin <izavo...@caci.com> wrote: > Suppose I have a bunch of text documents in language X but I index ithem > using an analyzer for language Y. Once the index is created, is it possible > to perform some sort of simple "sanity" check to see if the original language > selection was wrong? I presume I can try searching for some common word in > language Y, but I am not sure how reliable this would be. On the other hand, > if languages are from the same group, say X and Y are English and Spanish, I > should expect that this sanity check would produce a false match. However, I > would be happy if it worked reliably enough for languages using different > scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc. > > > Thanks much > > > > Ilya Zavorin -- - http://zzzoot.blogspot.com/ - --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org