[
https://issues.apache.org/jira/browse/OPENNLP-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480471#comment-16480471
]
ASF GitHub Bot commented on OPENNLP-1189:
-----------------------------------------
jzonthemtn closed pull request #307: OPENNLP-1189: Updating tokenizer input
description.
URL: https://github.com/apache/opennlp/pull/307
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/opennlp-docs/src/docbkx/tokenizer.xml
b/opennlp-docs/src/docbkx/tokenizer.xml
index 3fb451935..f1fae1949 100644
--- a/opennlp-docs/src/docbkx/tokenizer.xml
+++ b/opennlp-docs/src/docbkx/tokenizer.xml
@@ -215,7 +215,8 @@ double tokenProbs[] = tokenizer.getTokenProbabilities();]]>
available from the model download page on
various corpora. The data
can be converted to the OpenNLP Tokenizer
training format or used directly.
The OpenNLP format contains one sentence per line. Tokens are
either separated by a
- whitespace or by a special <SPLIT> tag.
+ whitespace or by a special <SPLIT> tag. Tokens are split
automaticaly on whitespace
+ and at least one <SPLIT> tag must be present in the
training text.
The following sample shows the sample from
above in the correct format.
<screen>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Token model creation fails without at least one <SPLIT> tag
> -----------------------------------------------------------
>
> Key: OPENNLP-1189
> URL: https://issues.apache.org/jira/browse/OPENNLP-1189
> Project: OpenNLP
> Issue Type: Bug
> Components: Tokenizer
> Affects Versions: 1.8.4
> Reporter: Jeff Zemerick
> Assignee: Jeff Zemerick
> Priority: Minor
> Fix For: 1.8.5
>
>
> The tokenizer training documentation for 1.8.4 states that "Tokens are either
> separated by a whitespace or by a special <SPLIT> tag." However, it appears
> that training files if the training data does not contain at least one
> <SPLIT> tag. To reproduce:
> Training on the sample data works fine:
> {quote}Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a
> nonexecutive director Nov. 29<SPLIT>.
> Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing
> group<SPLIT>.
> Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold
> Fields PLC<SPLIT>,
> was named a nonexecutive director of this British industrial
> conglomerate<SPLIT>.
> {quote}
> Replacing the <SPLIT> tags with whitespace causes the training to fail with
> InsufficientTrainingDataException:
> {quote}Pierre Vinken , 61 years old , will join the board as a nonexecutive
> director Nov. 29 .
> Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
> Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
> PLC ,
> was named a nonexecutive director of this British industrial conglomerate .
> {quote}
> Modifying the training data to contain a single <SPLIT> tag allows model
> training to complete successfully:
> {quote}Pierre Vinken<SPLIT>, 61 years old , will join the board as a
> nonexecutive director Nov. 29 .
> Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
> Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
> PLC ,
> was named a nonexecutive director of this British industrial conglomerate .
> {quote}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)