Paul McGuire wrote: > On Aug 26, 10:48 pm, Steven Bethard <[EMAIL PROTECTED]> wrote: >> In Japanese and Chinese tokenization, word boundaries are not marked by >> different classes of characters. They only exist in the mind of the >> reader who knows which sequences of characters could be words given the >> context, and which sequences of characters couldn't. >> >> The closest analog would be to ask pyparsing to find the words in the >> following sentence: >> >> ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode. >> >> Most approaches that have been even marginally successful on these kinds >> of tasks have used statistical machine learning approaches. > > You mean like this? > > from pyparsing import * > > knownWords = ['of', 'grammar', 'construct', 'classes', 'a', > 'client', 'pyparsing', 'directly', 'the', 'module', 'uses', > 'that', 'in', 'python', 'library', 'provides', 'code', 'to'] > > knownWord = oneOf( knownWords, caseless=True ) > sentence = OneOrMore( knownWord ) + "." > > mush = > "ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode." > > print sentence.parseString( mush ) > > prints: > > ['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of', > 'classes', 'that', 'client', 'code', 'uses', 'to', 'construct', > 'the', 'grammar', 'directly', 'in', 'python', 'code', '.'] > > In fact, this is almost the exact scheme used by Zhpy for extracting > Chinese versions of Python keywords, and mapping them back to English/ > Latin words. Of course, this is not practical for natural language > processing, as the vocabulary gets too large. And you can get > ambiguous matches, such as a vocabulary containing the words ['in', > 'to', 'into'] - the runtogether "into" will always be assumed to be > "into", and never "in to".
Yep, and these kinds of things occur quite frequently with Chinese and Japanese. The point was not that pyparsing couldn't do it for a small subset of characters/words, but that pyparsing is probably not the right solution for general purpose Japanese/Chinese tokenization. Steve -- http://mail.python.org/mailman/listinfo/python-list