Do the check _before_ indexing.
Use https://code.google.com/p/language-detection/  to verify the
language of the text document before you put it in the index.

-Glen Newton
http://zzzoot.blogspot.com/

On Mon, Feb 27, 2012 at 10:53 AM, Ilya Zavorin <izavo...@caci.com> wrote:
> Suppose I have a bunch of text documents in language X but I index ithem 
> using an analyzer for language Y. Once the index is created, is it possible 
> to perform some sort of simple "sanity" check to see if the original language 
> selection was wrong? I presume I can try searching for some common word in 
> language Y, but I am not sure how reliable this would be. On the other hand, 
> if languages are from the same group, say X and Y are English and Spanish, I 
> should expect that this sanity check would produce a false match. However, I 
> would be happy if it worked reliably enough for languages using different 
> scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc.
>
>
> Thanks much
>
>
>
> Ilya Zavorin



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to