Mary, Just wondering: those language contexts, are they on element or fragment level? E.g. if you have one doc fragment with lots of paragraphs with each their own xml:lang, will they be treated differently?
Kind regards, Geert -----Oorspronkelijk bericht----- Van: [email protected] [mailto:[email protected]] Namens Mary Holstege Verzonden: vrijdag 17 augustus 2012 19:53 Aan: MarkLogic Developer Discussion Onderwerp: Re: [MarkLogic Dev General] Tokenization Questions On Fri, 17 Aug 2012 10:24:06 -0700, Gabe Luchetta <[email protected]> wrote: > I have been assigned to testing the use of non-English languages for our > software that uses ML and have some questions about tokenization. > > According to the Search Developer's > Guide<http://developer.marklogic.com/pubs/4.1/books/search-dev-guide.pdf>, > "Asian or Middle Eastern characters will tokenize in a language > appropriate to the character set, even when they occur in elements that > are not in their language." > > During my testing I have found that if I tokenize a mixed > English/Japanese document using English as the tokenized language, it > DOES tokenize the Japanese, but I get different tokens than I do when I > process the same document using Japanese as the tokenized language. I > assume this is because tokens withing the detected character set are > shared between multiple Asian languages, or that it is relying on > simpler segmentation methods instead of really tokenizing the text, but > would like to have some more detail so that we can properly explain this > to our clients. > > Since we are using the built-in language detection to identify languages > at document level, this is proving to be problematic. If a document only > has a bit of Japanese in it, the Japanese score returned will be lower > than the English score, and we will likely mark the document as English. > If a user then attempts to search the Japanese content using Japanese as > the language option in the search, they won't get a hit on this > document. The will only get a hit if they construct their search the > same way it was tokenized and select English as the search option. > > I know this is a complex topic, but would appreciate whatever guidance > you could provide. The way tokenization works is that we look for runs of text in some particular language. If we only have the script to go on, rather than an explicit language identifier (e.g. xml:lang="ja") then we use some basic rules to make a guess. If the current language context is English and the character we are looking at is a CJK character, the default assumption is Chinese. If it were a Japanese-only character, the default assumption would be Japanese. If we were in a Japanese language context and we saw a CJK character, the assumption would be that we're still looking at Japanese. So that is probably why you are seeing the difference. This is especially an issue with Japanese, which uses multiple scripts, and where some words will start with CJK characters. The tokenizer doesn't look ahead to realize that some Japanese characters are coming up and that Chinese might be a bad guess. //Mary [email protected] Principal Engineer MarkLogic Corporation _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
