Michael, The inability to redistribute training data is a current problem with retraining and improving models:
https://cwiki.apache.org/OPENNLP/opennlp-annotations.html Also, see this discussion about "OpenNLP Annotations Proposal" on the opennlp-dev list: http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201106.mbox/thread It might take a little while to get this going, but we're all very keen to make progress on it! Jason On Fri, Jun 10, 2011 at 12:27 PM, Michael Schmitz <sch...@cs.washington.edu>wrote: > Hi, I was wondering if the training data for the OpenNLP maxent POS tagger > models is public and available somewhere. I would like to train models for > the pos tagger and the chunker that work on sentences without case (i.e. > all > capitalized). If I had the training data used for en-pos-maxent.bin, a > first pass would simply mean capitalizing the tokens and running the > trainer. It appears that the chunker training data somes from CONLL2000 ( > http://www.cnts.ua.ac.be/conll2000/chunking/). > > I would be happy to share the models with OpenNLP if anyone thought they > would be of use to others. > > Peace. Michael > -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge