Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you, along with the token filters in that same module. - Steve On Jan 8, 2013, at 6:43 PM, Trejkaz <trej...@trypticon.org> wrote:
> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantosh...@gmail.com> wrote: >> DoesLucene StandardAnalyzer work for all the languagues for tokenizing before >> indexing (since we are using java, I think the content is converted to UTF-8 >> before tokenizing/indeing)? > > No. There are multiple cases where it chooses not to break something > which it should break. Some of these cases even result in undesirable > behaviour for English, so I would be surprised if there were even a > single language which it handles acceptably. > > It does follow "Unicode standards" for how to tokenise text, but these > standards were written by people who didn't quite know what they were > doing so it's really just passing the buck. I don't think Lucene > should have chosen to follow that standard in the first place, because > it rarely (if ever) gives acceptable results. > > The worst examples for English, at least for us, were that it does not > break on colon (:) or underscore (_). > > Colon was explained by some languages using it like an apostrophe. > Personally I think you should break on an apostrophe as well, so I'm > not really happy with this reasoning, but OK. > > Underscore was completely baffling to me so I asked someone at Unicode > about it. They explained that it was because it was "used by > programmers to separate words in identifiers". This explanation is > exactly as stupid as it sounds and I hope they will realise their > stupidity some day. > >> or do we need to use special analyzers for each of the language. > > I do think that StandardTokenizer at least can form a good base for an > analyser. You just have to add a ton of filters to fix each additional > case you find where people don't like it. For instance, it returns > runs of Katakana as a single token, but if you did that, people > wouldn't find what they are searching for, so you make a filter to > split that back out into multiple tokens. > > It would help if there were a single, core-maintained analyser for > "StandardAnalyzer with all the things people hate fixed"... but I > don't know if anyone is interested in maintaining it. > >> In this case, if a document has a mixed case ( english + >> Japanese), what analyzer should we use and how can we figure it out >> dynamically before indexing? > > Some language detection libraries will give you back the fragments in > the text and tell you which language is used for each fragment, so > that is totally a viable option as well. You'd just make your own > analyser which concatenates the results. > >> Also, while searching if the query text contains (both english and >> Japanese), how does this work? Any criteria in choosing the analyzers? > > I guess you could either ask the user what language they're searching > in or look at what characters are in their query and decide which > language(s) it matches and build the query from there. It might match > multiple... > > TX > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org