On Fri, 31 Aug 2012 13:41:14 -0700, Gabe Luchetta <[email protected]> wrote:
> Mary: Thank you so much for your response earlier this month on the > tokenization questions. Can I assume that we would not have the same > issue with other languages, such as Chinese traditional/simplified or > Korean? Are there any other "Easter Eggs" regarding language we should > be aware of? > > Thank you, > > Gabe In theory you could see it with Korean, but it is less of an issue in practice than it is for Japanese because Korean use of the shared CJK characters is much more limited that it is in Japanese. The other issue to be aware of for languages and search is that stemming is case and diacritic sensitive. In German, where the basic form of a noun is capitalized, the stemmed form of the lowercased form of a noun will probably not match the stemmed form of the proper form of the noun. This can be an issue if you do case-insensitive searches, as then we are looking at the stems of the lowercase forms, which will probably just be the whole word itself. So a case-insensitive stemmed search in German is going to lose on recall. Similarly for a diacritic-insensitive search in languages that care about accents, such as French. When you do stemmed searches in these languages you should explicitly set them to case-/diacritic- sensitive or add in that as an alternative. You don't see this in English because in English the proper form of words has neither uppercase nor diacritics (in general, or the form without diacritics is an accepted alternative). //Mary > > Gabe Luchetta > Product Management > Catalyst Repository Systems, Inc. > 1860 Blake Street, Ste. 700 > Denver, CO 80202 > P: 303.824.0820 > C: 720.339.5085 > E: [email protected]<mailto:[email protected]> > W: www.catalystsecure.com<http://www.catalystsecure.com> > > Powering Complex Legal Matters > > > > > > On Fri, Aug 17, 2012 at 11:24 AM, Gabe Luchetta > <[email protected]<mailto:[email protected]>> > wrote: > I have been assigned to testing the use of non-English languages for our > software that uses ML and have some questions about tokenization. > > According to the Search Developer's > Guide<http://developer.marklogic.com/pubs/4.1/books/search-dev-guide.pdf>, > "Asian or Middle Eastern characters will tokenize in a language > appropriate to the character set, even when they occur in elements that > are not in their language." > > During my testing I have found that if I tokenize a mixed > English/Japanese document using English as the tokenized language, it > DOES tokenize the Japanese, but I get different tokens than I do when I > process the same document using Japanese as the tokenized language. I > assume this is because tokens withing the detected character set are > shared between multiple Asian languages, or that it is relying on > simpler segmentation methods instead of really tokenizing the text, but > would like to have some more detail so that we can properly explain this > to our clients. > > Since we are using the built-in language detection to identify languages > at document level, this is proving to be problematic. If a document only > has a bit of Japanese in it, the Japanese score returned will be lower > than the English score, and we will likely mark the document as English. > If a user then attempts to search the Japanese content using Japanese as > the language option in the search, they won't get a hit on this > document. The will only get a hit if they construct their search the > same way it was tokenized and select English as the search option. > > I know this is a complex topic, but would appreciate whatever guidance > you could provide. > > Thank you, > > Gabe > > > -- Using Opera's revolutionary email client: http://www.opera.com/mail/ _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
