[ 
https://issues.apache.org/jira/browse/OPENNLP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291641#comment-16291641
 ] 

ASF GitHub Bot commented on OPENNLP-1166:
-----------------------------------------

autayeu commented on a change in pull request #294: OPENNLP-1166: 
TwoPassDataIndexer fails if features contain \n
URL: https://github.com/apache/opennlp/pull/294#discussion_r157066629
 
 

 ##########
 File path: 
opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java
 ##########
 @@ -59,20 +62,28 @@ public void index(ObjectStream<Event> eventStream) throws 
IOException {
     File tmp = File.createTempFile("events", null);
     tmp.deleteOnExit();
     int numEvents;
-    try (Writer osw = new BufferedWriter(new OutputStreamWriter(new 
FileOutputStream(tmp),
-        StandardCharsets.UTF_8))) {
-      numEvents = computeEventCounts(eventStream, osw, predicateIndex, cutoff);
+    BigInteger writeHash;
+    HashSumEventStream writeEventStream = new HashSumEventStream(eventStream); 
 // do not close.
+    try (DataOutputStream dos = new DataOutputStream(new 
BufferedOutputStream(new FileOutputStream(tmp)))) {
+      numEvents = computeEventCounts(writeEventStream, dos, predicateIndex, 
cutoff);
     }
+    writeHash = writeEventStream.calculateHashSum();
+
     display("done. " + numEvents + " events\n");
 
     display("\tIndexing...  ");
 
     List<ComparableEvent> eventsToCompare;
-    try (FileEventStream fes = new FileEventStream(tmp)) {
-      eventsToCompare = index(fes, predicateIndex);
+    BigInteger readHash = null;
+    try (HashSumEventStream readStream = new HashSumEventStream(new 
EventStream(tmp))) {
+      eventsToCompare = index(readStream, predicateIndex);
+      readHash = readStream.calculateHashSum();
     }
-
     tmp.delete();
+
+    if (readHash.compareTo(writeHash) != 0)
+      throw new RuntimeException("Event hash for writing and reading events 
did not match.");
 
 Review comment:
   Got it. Thank you.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> TwoPassDataIndexer fails if features contain \n
> -----------------------------------------------
>
>                 Key: OPENNLP-1166
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1166
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Machine Learning
>    Affects Versions: 1.8.3
>            Reporter: Peter Thygesen
>            Assignee: Peter Thygesen
>
> Training a model with Newline tokens causes TwoPassDataIndexer to throw 
> exception
> Exception in thread "main" java.util.NoSuchElementException
>     at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
>     at opennlp.tools.ml.model.FileEventStream.read(FileEventStream.java:71)
>     at opennlp.tools.ml.model.FileEventStream.read(FileEventStream.java:35)
>     at 
> opennlp.tools.ml.model.AbstractDataIndexer.index(AbstractDataIndexer.java:168)
>     at 
> opennlp.tools.ml.model.TwoPassDataIndexer.index(TwoPassDataIndexer.java:72)
>     at 
> opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:68)
>     at 
> opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:90)
>     at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:244)
>     at 
> opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNameFinderTrainerTool.java:169)
>     at opennlp.tools.cmdline.CLI.main(CLI.java:256)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to