I have been assigned to testing the use of non-English languages for our software that uses ML and have some questions about tokenization.
According to the Search Developer's Guide<http://developer.marklogic.com/pubs/4.1/books/search-dev-guide.pdf>, "Asian or Middle Eastern characters will tokenize in a language appropriate to the character set, even when they occur in elements that are not in their language." During my testing I have found that if I tokenize a mixed English/Japanese document using English as the tokenized language, it DOES tokenize the Japanese, but I get different tokens than I do when I process the same document using Japanese as the tokenized language. I assume this is because tokens withing the detected character set are shared between multiple Asian languages, or that it is relying on simpler segmentation methods instead of really tokenizing the text, but would like to have some more detail so that we can properly explain this to our clients. Since we are using the built-in language detection to identify languages at document level, this is proving to be problematic. If a document only has a bit of Japanese in it, the Japanese score returned will be lower than the English score, and we will likely mark the document as English. If a user then attempts to search the Japanese content using Japanese as the language option in the search, they won't get a hit on this document. The will only get a hit if they construct their search the same way it was tokenized and select English as the search option. I know this is a complex topic, but would appreciate whatever guidance you could provide. Thank you, Gabe * *
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
