Hi Michael, Maybe you could use the CONLL2000 data. What do you think? It includes POS tags. To use it you will need to create a new converter:
1. Create a new POSSample stream for the CONLL2000, it is similar to ConllXPOSSampleStream<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/formats/ConllXPOSSampleStream.java?view=markup> ; 2. Create a factory for your new class, similar to ConllXPOSSampleStreamFactory<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/formats/ConllXPOSSampleStreamFactory.java?view=markup>, this class is required to launch the formatter from command line; 3. Finally add the factory to POSTaggerConverter<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/postag/POSTaggerConverter.java?view=markup>. With the converted sample you you will be able to train your model like explained in the documentation. It would be nice if you could contribute back with a patch adding your new converter. Regards, William On Fri, Jun 10, 2011 at 11:21 PM, Jason Baldridge <jasonbaldri...@gmail.com>wrote: > Michael, > > The inability to redistribute training data is a current problem with > retraining and improving models: > > https://cwiki.apache.org/OPENNLP/opennlp-annotations.html > > Also, see this discussion about "OpenNLP Annotations Proposal" on the > opennlp-dev list: > > > http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201106.mbox/thread > > It might take a little while to get this going, but we're all very keen to > make progress on it! > > Jason > > On Fri, Jun 10, 2011 at 12:27 PM, Michael Schmitz > <sch...@cs.washington.edu>wrote: > > > Hi, I was wondering if the training data for the OpenNLP maxent POS > tagger > > models is public and available somewhere. I would like to train models > for > > the pos tagger and the chunker that work on sentences without case (i.e. > > all > > capitalized). If I had the training data used for en-pos-maxent.bin, a > > first pass would simply mean capitalizing the tokens and running the > > trainer. It appears that the chunker training data somes from CONLL2000 > ( > > http://www.cnts.ua.ac.be/conll2000/chunking/). > > > > I would be happy to share the models with OpenNLP if anyone thought they > > would be of use to others. > > > > Peace. Michael > > > > > > -- > Jason Baldridge > Assistant Professor, Department of Linguistics > The University of Texas at Austin > http://www.jasonbaldridge.com > http://twitter.com/jasonbaldridge >