[ 
https://issues.apache.org/jira/browse/OPENNLP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288331#comment-16288331
 ] 

ASF GitHub Bot commented on OPENNLP-1166:
-----------------------------------------

kottmann commented on a change in pull request #294: OPENNLP-1166: 
TwoPassDataIndexer fails if features contain \n
URL: https://github.com/apache/opennlp/pull/294#discussion_r156507092
 
 

 ##########
 File path: 
opennlp-tools/src/main/java/opennlp/tools/ml/model/TwoPassDataIndexer.java
 ##########
 @@ -59,20 +62,28 @@ public void index(ObjectStream<Event> eventStream) throws 
IOException {
     File tmp = File.createTempFile("events", null);
     tmp.deleteOnExit();
     int numEvents;
-    try (Writer osw = new BufferedWriter(new OutputStreamWriter(new 
FileOutputStream(tmp),
-        StandardCharsets.UTF_8))) {
-      numEvents = computeEventCounts(eventStream, osw, predicateIndex, cutoff);
+    BigInteger writeHash;
+    HashSumEventStream writeEventStream = new HashSumEventStream(eventStream); 
 // do not close.
+    try (DataOutputStream dos = new DataOutputStream(new 
BufferedOutputStream(new FileOutputStream(tmp)))) {
+      numEvents = computeEventCounts(writeEventStream, dos, predicateIndex, 
cutoff);
     }
+    writeHash = writeEventStream.calculateHashSum();
+
     display("done. " + numEvents + " events\n");
 
     display("\tIndexing...  ");
 
     List<ComparableEvent> eventsToCompare;
-    try (FileEventStream fes = new FileEventStream(tmp)) {
-      eventsToCompare = index(fes, predicateIndex);
+    BigInteger readHash = null;
+    try (HashSumEventStream readStream = new HashSumEventStream(new 
EventStream(tmp))) {
+      eventsToCompare = index(readStream, predicateIndex);
+      readHash = readStream.calculateHashSum();
     }
-
     tmp.delete();
+
+    if (readHash.compareTo(writeHash) != 0)
+      throw new RuntimeException("Event hash for writing and reading events 
did not match.");
 
 Review comment:
   Seems nicer to me to throw here an IOException instead, after all this 
probably happens because of io issues.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> TwoPassDataIndexer fails if features contain \n
> -----------------------------------------------
>
>                 Key: OPENNLP-1166
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1166
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Machine Learning
>    Affects Versions: 1.8.3
>            Reporter: Peter Thygesen
>            Assignee: Peter Thygesen
>
> Training a model with Newline tokens causes TwoPassDataIndexer to throw 
> exception
> Exception in thread "main" java.util.NoSuchElementException
>     at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
>     at opennlp.tools.ml.model.FileEventStream.read(FileEventStream.java:71)
>     at opennlp.tools.ml.model.FileEventStream.read(FileEventStream.java:35)
>     at 
> opennlp.tools.ml.model.AbstractDataIndexer.index(AbstractDataIndexer.java:168)
>     at 
> opennlp.tools.ml.model.TwoPassDataIndexer.index(TwoPassDataIndexer.java:72)
>     at 
> opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:68)
>     at 
> opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:90)
>     at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:244)
>     at 
> opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNameFinderTrainerTool.java:169)
>     at opennlp.tools.cmdline.CLI.main(CLI.java:256)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to