Re: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 2

Richard Eckart de Castilho Mon, 15 May 2017 02:24:21 -0700

Hi Jörn,

unfortunately, I don't have a comprehensive unit test for the parser.
But from my perspective, it looks like there is something more serious
going on than just a bit of parse tree reordering when suddenly a
determiner (DT, "a") is tagged as a punctuation mark (, ","):


1.7.2: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(NP (NP (DT a)*!* (ADJP 
       (RB very) (VBN complicated)) (NN example) (NN sentence)) 
       (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP 
       (IN as) (NP (NP (JJ many) (NNS constituents) (CC and) 
       (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .)))

1.8.0: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(, a) (NP (NP*!* (ADJP 
       (RB very) (VBN complicated)) (NN example) (NN sentence))
       (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP 
       (IN as) (NP (NP (JJ many) (NNS constituents) (CC and) 
       (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .)))

IMHO punctuation marks and closed word classes like determiners should be
pretty stable in their labels. There is usually no need for the parser to
invent a tag and the labels seen in the training data should be sufficient.

Could there maybe be a problem with duplicates being dropped silently
by the move from the ListHeap to the TreeSet? If duplicate removal
is not important, then maybe sorting the heap after it has been filled
would be a better option than using a permanently sorted and de-duplicating
data structure.

Cheers,

-- Richard

> On 15.05.2017, at 10:39, Joern Kottmann <kottm...@gmail.com> wrote:
> 
> Hello Richard,
> 
> thanks for reporting this. For 1.8.0 we replaced a Heap with a SortedSet
> [1]. In this commit there is one loop [2] which iterates through the parses
> which will be advanced. The order of the Parsers in the Heap was not so
> well defined, therefore we decided to sort them by probability.
> We also noticed that this change is changing the output of the parser with
> the existing models in our SourceForge model eval test [3].
> 
> After running the evaluation on the OntoNotes4 data set I only got  very
> small change and decided it is ok to do this. I am not aware of how big the
> change is but is was less than the delta in test case [4] of 0.001.
> 
> What do you think? Should this be rolled back?
> 
> Anyway, that said, about the parser, I still need to understand what
> happened with the lemmatizer.
> 
> Jörn
> 
> [1]
> https://github.com/apache/opennlp/commit/3df659b9bfb02084e782f1e8b6ec716f56e0611c
> [2]
> https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f1e8b6ec716f56e0611c/opennlp-tools/src/main/java/opennlp/tools/parser/AbstractBottomUpParser.java#L285
> [3]
> https://github.com/apache/opennlp/commit/3df659b9bfb02084e782f1e8b6ec716f56e0611c#diff-a5834f32b8a41b76a336126e4b13d4f7L349
> [4]
> https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f1e8b6ec716f56e0611c/opennlp-tools/src/test/java/opennlp/tools/eval/OntoNotes4ParserEval.java#L70

Re: [VOTE] Apache OpenNLP 1.8.0 Release Candidate 2

Reply via email to