Hi Michael,

Maybe you could use the CONLL2000 data. What do you think? It includes POS
tags.
To use it you will need to create a new converter:

   1. Create a new POSSample stream for the CONLL2000, it is similar to
   
ConllXPOSSampleStream<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/formats/ConllXPOSSampleStream.java?view=markup>
   ;
   2. Create a factory for your new class, similar to
   
ConllXPOSSampleStreamFactory<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/formats/ConllXPOSSampleStreamFactory.java?view=markup>,
   this class is required to launch the formatter from command line;
   3. Finally add the factory to
POSTaggerConverter<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/postag/POSTaggerConverter.java?view=markup>.


With the converted sample you you will be able to train your model like
explained in the documentation.
It would be nice if you could contribute back with a patch adding your new
converter.

Regards,
William


On Fri, Jun 10, 2011 at 11:21 PM, Jason Baldridge
<jasonbaldri...@gmail.com>wrote:

> Michael,
>
> The inability to redistribute training data is a current problem with
> retraining and improving models:
>
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>
> Also, see this discussion about "OpenNLP Annotations Proposal" on the
> opennlp-dev list:
>
>
> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201106.mbox/thread
>
> It might take a little while to get this going, but we're all very keen to
> make progress on it!
>
> Jason
>
> On Fri, Jun 10, 2011 at 12:27 PM, Michael Schmitz
> <sch...@cs.washington.edu>wrote:
>
> > Hi, I was wondering if the training data for the OpenNLP maxent POS
> tagger
> > models is public and available somewhere.  I would like to train models
> for
> > the pos tagger and the chunker that work on sentences without case (i.e.
> > all
> > capitalized).  If I had the training data used for en-pos-maxent.bin, a
> > first pass would simply mean capitalizing the tokens and running the
> > trainer.  It appears that the chunker training data somes from CONLL2000
> (
> > http://www.cnts.ua.ac.be/conll2000/chunking/).
> >
> > I would be happy to share the models with OpenNLP if anyone thought they
> > would be of use to others.
> >
> > Peace.  Michael
> >
>
>
>
> --
> Jason Baldridge
> Assistant Professor, Department of Linguistics
> The University of Texas at Austin
> http://www.jasonbaldridge.com
> http://twitter.com/jasonbaldridge
>

Reply via email to