On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan
<ivan.prova...@cengage.com> wrote:
> We are planning to ingest some non-English content into our application.  All 
> content is OCR'ed and there are a lot of misspellings and garbage terms 
> because of this.  Each document has one primary language with a some 
> exceptions (e.g. a few English terms mixed in with primarily non-English 
> document terms).
>

sounds like you should talk to Tom Burton-West!

> 1. Does it make sense to mix two or more different Latin-based languages in 
> the same index directory in Lucene (e.g. Spanish/French/English)?

I think it depends upon the application. If the user is specifying the
language via the UI somehow then its probably simplest to just use
different indexes for each collection.

> 2. What about mixing Latin and non-Latin languages?  We ran tests on English 
> and Chinese collections mixed together and didn't see any negative impact 
> (precision/recall).  Any other potential issues?

Right, none of the terms would overlap here... the only "issue" would
be a skewed maxDoc but this is probably not a big deal at all. But
whats the benefit to mixing them?

> 3. Any recommendations for an Urdu analyzer?
>

you can always start with standardanalyzer as it will tokenize it...
you might be able to make use of resources such as
http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm
and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm
as a stoplist.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to