[
https://issues.apache.org/jira/browse/OPENNLP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291638#comment-16291638
]
ASF GitHub Bot commented on OPENNLP-1166:
-----------------------------------------
kottmann commented on a change in pull request #294: OPENNLP-1166:
TwoPassDataIndexer fails if features contain \n
URL: https://github.com/apache/opennlp/pull/294#discussion_r157066030
##########
File path:
opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java
##########
@@ -59,20 +62,28 @@ public void index(ObjectStream<Event> eventStream) throws
IOException {
File tmp = File.createTempFile("events", null);
tmp.deleteOnExit();
int numEvents;
- try (Writer osw = new BufferedWriter(new OutputStreamWriter(new
FileOutputStream(tmp),
- StandardCharsets.UTF_8))) {
- numEvents = computeEventCounts(eventStream, osw, predicateIndex, cutoff);
+ BigInteger writeHash;
+ HashSumEventStream writeEventStream = new HashSumEventStream(eventStream);
// do not close.
+ try (DataOutputStream dos = new DataOutputStream(new
BufferedOutputStream(new FileOutputStream(tmp)))) {
+ numEvents = computeEventCounts(writeEventStream, dos, predicateIndex,
cutoff);
}
+ writeHash = writeEventStream.calculateHashSum();
+
display("done. " + numEvents + " events\n");
display("\tIndexing... ");
List<ComparableEvent> eventsToCompare;
- try (FileEventStream fes = new FileEventStream(tmp)) {
- eventsToCompare = index(fes, predicateIndex);
+ BigInteger readHash = null;
+ try (HashSumEventStream readStream = new HashSumEventStream(new
EventStream(tmp))) {
+ eventsToCompare = index(readStream, predicateIndex);
+ readHash = readStream.calculateHashSum();
}
-
tmp.delete();
+
+ if (readHash.compareTo(writeHash) != 0)
+ throw new RuntimeException("Event hash for writing and reading events
did not match.");
Review comment:
The cause for exception will be problems with writing and reading from disk.
Why is IOException not the right one here?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> TwoPassDataIndexer fails if features contain \n
> -----------------------------------------------
>
> Key: OPENNLP-1166
> URL: https://issues.apache.org/jira/browse/OPENNLP-1166
> Project: OpenNLP
> Issue Type: Improvement
> Components: Machine Learning
> Affects Versions: 1.8.3
> Reporter: Peter Thygesen
> Assignee: Peter Thygesen
>
> Training a model with Newline tokens causes TwoPassDataIndexer to throw
> exception
> Exception in thread "main" java.util.NoSuchElementException
> at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
> at opennlp.tools.ml.model.FileEventStream.read(FileEventStream.java:71)
> at opennlp.tools.ml.model.FileEventStream.read(FileEventStream.java:35)
> at
> opennlp.tools.ml.model.AbstractDataIndexer.index(AbstractDataIndexer.java:168)
> at
> opennlp.tools.ml.model.TwoPassDataIndexer.index(TwoPassDataIndexer.java:72)
> at
> opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:68)
> at
> opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:90)
> at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:244)
> at
> opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNameFinderTrainerTool.java:169)
> at opennlp.tools.cmdline.CLI.main(CLI.java:256)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)