[
https://issues.apache.org/jira/browse/OPENNLP-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tristan Nixon updated OPENNLP-857:
----------------------------------
Attachment: ParserToolTokenize.patch
My patch
> ParserTool should take use Tokenizer instance. It should not use
> java.util.StringTokenizer
> ------------------------------------------------------------------------------------------
>
> Key: OPENNLP-857
> URL: https://issues.apache.org/jira/browse/OPENNLP-857
> Project: OpenNLP
> Issue Type: Improvement
> Components: Parser
> Affects Versions: 1.6.0
> Reporter: Tristan Nixon
> Attachments: ParserToolTokenize.patch
>
>
> It would be nice if the ParserTool would make use of a real tokenizer. In
> addition to being the "right" thing to do, it would obviate issues like
> OPENNLP-240 when using the parser tool.
> While I realize that java.util.StringTokenizer effectively does the same work
> as WhitespaceTokenizer, it seems odd to use the former when the latter exists.
> To this end, I'm attaching a patch that adds an additional method
> public static Parse[] parseLine(String line, Parser parser, Tokenizer
> tokenizer, int numParses)
> I've left the existing method
> public static Parse[] parseLine(String line, Parser parser, int numParses)
> in for convenience and backwards compatibility. It simply calls the new
> method with WhitespaceTokenizer.INSTANCE
> For good measure, I've added a new command-line argument -tk, which takes the
> name of a tokenizer model. If none is specified, it will fall back on the
> current behavior of using the whitespace tokenizer.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)