rzo1 commented on PR #559: URL: https://github.com/apache/opennlp/pull/559#issuecomment-1846694280
> Is there a spec for this behavior? The Penn Treebank guidelines suggest to tokenize as `ca` + `n't` and `do` + `n't`. The Python Guys in [NLTK](https://www.nltk.org/_modules/nltk/tokenize/treebank.html) adhere to this convention (if the Penn TreeBank Tokenizer is used). Another example is the English phrasea 12-ft boat . How shall we handle the hyphenated length expression? Is this one or two or even three tokens. From a very quick literature review it seems, that this ambiquity is an implementation detail and not really defined (as it depends on the actual use-case). Looking at the [Stanford Tokenizer](https://stanfordnlp.github.io/CoreNLP/tokenize.html) they have a bunch of configeration options for a lot of normalization stuff happening during tokenizing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
