[
https://issues.apache.org/jira/browse/OPENNLP-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607527#comment-13607527
]
Joern Kottmann edited comment on OPENNLP-546 at 1/11/17 12:32 PM:
------------------------------------------------------------------
The current implementation of the cTAKES PTB tokenizer outputs newline tokens,
but the OpenNLP tokenizers don't support this yet.
There are two ways of supporting this:
- Only output the tokens without newline tokens and add the newline tokens in a
second run, e.g. by a UIMA AE
- Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>,
or a span with this as the type)
was (Author: joern):
The current implementation of the PTB tokenizer outputs newline tokens, but the
OpenNLP tokenizers don't support this yet.
There are two ways of supporting this:
- Only output the tokens without newline tokens and add the newline tokens in a
second run, e.g. by a UIMA AE
- Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>,
or a span with this as the type)
> Add TokenizerPTB
> ----------------
>
> Key: OPENNLP-546
> URL: https://issues.apache.org/jira/browse/OPENNLP-546
> Project: OpenNLP
> Issue Type: New Feature
> Components: Tokenizer
> Reporter: Pei Chen
> Priority: Minor
> Labels: help-wanted
>
> Add Tokenizer based on Penn Tree Bank rules.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)