Okay, I'll commit the ClearPOSTagger and make it available in the ctakes-pos-tagger component, but leave everything as they currently are (currently default to OpenNLP). We can always switch one or the other in the future (when there is a fair comparison/benchmark).
Note: I think there is a pretty significant speed improvement in the ClearPOSTagger as well. > -----Original Message----- > From: Lee Becker [mailto:[email protected]] > Sent: Monday, April 08, 2013 2:29 PM > To: [email protected] > Subject: Re: ClearNLP POSTagger > > On Mon, Apr 8, 2013 at 12:04 PM, Steven Bethard > <[email protected] > > wrote: > > > > While working on the Dependency Parser/SRL labeler, we also have a > > POSTagger from ClearNLP. It is fairly simple and I have the code > > ready (also trained on the same data as the dep parser- MiPaq/SHARP) > > to be checked-in. What does the folks think: > > > We can include both Analysis Engines in the ctakes-pos-tagger project. > > But should we leave the current OpenNLP in the default pipeline or > > default to the latest? > > > > My vote would be to default for whatever has the best performance. > > Presumably the ClearNLP one? > > > > > "The ClearNLP POS tagger shows more robust results on unknown words > > > by > > generalizing lexical features. > > > > Looking at the paper, ClearNLP POS tagger is not compared directly to > > the cTAKES OpenNLP POS tagger, but they do outperform the Stanford > > tagger trained on the same data, so it's probably a reasonable guess > > that they're more accurate than the OpenNLP tagger. > > > > > It also uses AdaGrad for machine learning, which is a more advanced > > learning algorithm than maximum entropy used by OpenNLP." > > > > My opinion is that we should never include a model in cTAKES just > > because it has a "more advanced learning algorithm". "More advanced > > learning algorithm" does not always translate into better performance. > > > If my memory is serving me correctly, I think Jinho trained his parsers off of > predicted POS tags to get eke out the extra performance. The takeaway > being that ClearNLP does better when you can use as much of its pipeline as > possible.
