On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan <ivan.prova...@cengage.com> wrote: > We are planning to ingest some non-English content into our application. All > content is OCR'ed and there are a lot of misspellings and garbage terms > because of this. Each document has one primary language with a some > exceptions (e.g. a few English terms mixed in with primarily non-English > document terms). >
sounds like you should talk to Tom Burton-West! > 1. Does it make sense to mix two or more different Latin-based languages in > the same index directory in Lucene (e.g. Spanish/French/English)? I think it depends upon the application. If the user is specifying the language via the UI somehow then its probably simplest to just use different indexes for each collection. > 2. What about mixing Latin and non-Latin languages? We ran tests on English > and Chinese collections mixed together and didn't see any negative impact > (precision/recall). Any other potential issues? Right, none of the terms would overlap here... the only "issue" would be a skewed maxDoc but this is probably not a big deal at all. But whats the benefit to mixing them? > 3. Any recommendations for an Urdu analyzer? > you can always start with standardanalyzer as it will tokenize it... you might be able to make use of resources such as http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm as a stoplist. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org