Mary: Thank you so much for your response earlier this month on the tokenization questions. Can I assume that we would not have the same issue with other languages, such as Chinese traditional/simplified or Korean? Are there any other "Easter Eggs" regarding language we should be aware of?
Thank you, Gabe *Gabe Luchetta* Product Management Catalyst Repository Systems, Inc. 1860 Blake Street, Ste. 700 Denver, CO 80202 P: 303.824.0820 C: 720.339.5085 E: [email protected] W: www.catalystsecure.com *Powering Complex Legal Matters* * * On Fri, Aug 17, 2012 at 11:24 AM, Gabe Luchetta < [email protected]> wrote: > I have been assigned to testing the use of non-English languages for our > software that uses ML and have some questions about tokenization. > > According to the Search Developer's > Guide<http://developer.marklogic.com/pubs/4.1/books/search-dev-guide.pdf>, > "Asian or Middle Eastern characters will tokenize in a language appropriate > to the character set, even when they occur in elements that are not in > their language." > > During my testing I have found that if I tokenize a mixed English/Japanese > document using English as the tokenized language, it DOES tokenize the > Japanese, but I get different tokens than I do when I process the same > document using Japanese as the tokenized language. I assume this is because > tokens withing the detected character set are shared between multiple Asian > languages, or that it is relying on simpler segmentation methods instead of > really tokenizing the text, but would like to have some more detail so that > we can properly explain this to our clients. > > Since we are using the built-in language detection to identify languages > at document level, this is proving to be problematic. If a document only > has a bit of Japanese in it, the Japanese score returned will be lower than > the English score, and we will likely mark the document as English. If a > user then attempts to search the Japanese content using Japanese as the > language option in the search, they won't get a hit on this document. The > will only get a hit if they construct their search the same way it was > tokenized and select English as the search option. > > I know this is a complex topic, but would appreciate whatever guidance you > could provide. > > Thank you, > > Gabe > * > * > >
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
