[
https://issues.apache.org/jira/browse/OPENNLP-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082061#comment-16082061
]
Alessandro Depase commented on OPENNLP-1115:
--------------------------------------------
I don't seem this answer is complete, neither (completely) correct, sorry:
# the question was about *why* the lines are dropped, you are not answering. I
mean: what is the reason behind dropping them? I tried to read the code, (I
stopped when it required too much time without downloading and debugging it)
and that's what I understood: the _AbstractDataIndexer _throws the exception in
the method _sortAndMerge _because it "thinks" there isn't enough data; but it
works on the _List eventsToCompare_, which is the result of a previous
computation, which happens in the same class, method _index(ObjectStream<Event>
events, Map<String, Integer> predicateIndex)_ : there the code builds a int[]
starting from each line in a way I cannot completely understood (my question,
at the very end, is: *what is the logic behind the compilation of this
array*?). If the array has more than an element, then ok, we have elements to
compare (and the sortAndMerge will not throw this Exception), else the line is
dropped. So: *what is the logic behind dropping the line*?
# the answer itself could be also correct, but it is not in agreement with the
documentation, which just talks about the cutoff value. So: _to complete the
answer_, is there a way to quantifiy the minimum quantity of lines or words or
whatever needed? Why are available online examples working (no matter the
quality, I completely understand that it will not produce a meaningful result
in a real case) with 10 lines and my example not?
Thank you in advance for your patience
Alessandro
> Document Categorizer all events dropped
> ---------------------------------------
>
> Key: OPENNLP-1115
> URL: https://issues.apache.org/jira/browse/OPENNLP-1115
> Project: OpenNLP
> Issue Type: Question
> Components: Doccat
> Affects Versions: 1.7.2
> Reporter: Alessandro Depase
> Attachments: Train1.train
>
>
> Hi all,
> I'm trying to perform my first (newbie) document categorization using italian
> language.
> I'm using the attached train file and i got this output:
> {{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data
> "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train"
> -encoding UTF-8
> Indexing events using cutoff of 5
> Computing event counts... done. 12 events
> Indexing... Dropped event Ok:[bow=ok]
> Dropped event Ok:[bow=tutto, bow=bene]
> Dropped event Ok:[bow=decisamente, bow=non, bow=male]
> Dropped event Ok:[bow=fantastica, bow=scelta]
> Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere,
> bow=così, bow=contento]
> Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
> Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
> Dropped event no:[bow=per, bow=nulla]
> Dropped event no:[bow=niente, bow=affatto, bow=divertente]
> Dropped event no:[bow=va, bow=malissimo]
> Dropped event no:[bow=va, bow=decisamente, bow=male]
> Dropped event no:[bow=sono, bow=molto, bow=triste]
> done.
> Sorting and merging events...
> ERROR: Not enough training data
> The provided training data is not sufficient to create enough events to train
> a model.
> To resolve this error use more training data, if this doesn't help there might
> be some fundamental problem with the training data itself.}}
> I already found a couple of other similar issues, just saying that there are
> not enough lines (but I have 6 lines for each category and a cutoff of 5) or
> that without at least 100 lines the categorization quality is not sufficient
> (ok, but that's just a quality matter, it should work, with bad results, but
> it should work). The reason for insufficient data is that all the lines are
> dropped.
> I also tried with java api, same result.
> But why? What did I miss? I cannot find useful documentation...
> Thank you in advance
> Kind Regards
> Alessandro
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)