Hi Jörn, unfortunately, I don't have a comprehensive unit test for the parser. But from my perspective, it looks like there is something more serious going on than just a bit of parse tree reordering when suddenly a determiner (DT, "a") is tagged as a punctuation mark (, ","):
1.7.2: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(NP (NP (DT a)*!* (ADJP (RB very) (VBN complicated)) (NN example) (NN sentence)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP (IN as) (NP (NP (JJ many) (NNS constituents) (CC and) (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .))) 1.8.0: (ROOT (S (NP (PRP We)) (VP (VBP need) *!*(, a) (NP (NP*!* (ADJP (RB very) (VBN complicated)) (NN example) (NN sentence)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ contains) (PP (IN as) (NP (NP (JJ many) (NNS constituents) (CC and) (NNS dependencies)) (PP (IN as) (ADJP (JJ possible)))))))))) (. .))) IMHO punctuation marks and closed word classes like determiners should be pretty stable in their labels. There is usually no need for the parser to invent a tag and the labels seen in the training data should be sufficient. Could there maybe be a problem with duplicates being dropped silently by the move from the ListHeap to the TreeSet? If duplicate removal is not important, then maybe sorting the heap after it has been filled would be a better option than using a permanently sorted and de-duplicating data structure. Cheers, -- Richard > On 15.05.2017, at 10:39, Joern Kottmann <kottm...@gmail.com> wrote: > > Hello Richard, > > thanks for reporting this. For 1.8.0 we replaced a Heap with a SortedSet > [1]. In this commit there is one loop [2] which iterates through the parses > which will be advanced. The order of the Parsers in the Heap was not so > well defined, therefore we decided to sort them by probability. > We also noticed that this change is changing the output of the parser with > the existing models in our SourceForge model eval test [3]. > > After running the evaluation on the OntoNotes4 data set I only got very > small change and decided it is ok to do this. I am not aware of how big the > change is but is was less than the delta in test case [4] of 0.001. > > What do you think? Should this be rolled back? > > Anyway, that said, about the parser, I still need to understand what > happened with the lemmatizer. > > Jörn > > [1] > https://github.com/apache/opennlp/commit/3df659b9bfb02084e782f1e8b6ec716f56e0611c > [2] > https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f1e8b6ec716f56e0611c/opennlp-tools/src/main/java/opennlp/tools/parser/AbstractBottomUpParser.java#L285 > [3] > https://github.com/apache/opennlp/commit/3df659b9bfb02084e782f1e8b6ec716f56e0611c#diff-a5834f32b8a41b76a336126e4b13d4f7L349 > [4] > https://github.com/apache/opennlp/blob/3df659b9bfb02084e782f1e8b6ec716f56e0611c/opennlp-tools/src/test/java/opennlp/tools/eval/OntoNotes4ParserEval.java#L70