2011/6/10 Jason Baldridge <[email protected]>: > This looks great! I don't have time to look at this in great detail right > now, but am happy to give feedback on particular issues and questions. > > Active learning would be nice to add eventually, but it has to be done with > great care, e.g. using uncertainty alone doesn't really work that well and > care needs to be taken with class imbalance etc. Random sampling is a good > starting point, and can be used while ironing out the details.
Acknowledged. I wasn't planning to implement this part myself anyway. > I can't remember if this has been discussed before, but does there need to > be a non-OpenNLP group which has a primary purpose of creating open > standardized datasets and annotation interfaces, etc? > > It seems also we might be able to get some corporate sponsorship for > annotation, improvements to models, creation of resources for specific > languages, etc. No idea. I think Jacob Perkins (and possibly others) who works with NLTK was also interested in such open copora. See for instance this thread on metaoptimize.com/qa: http://metaoptimize.com/qa/questions/4650/what-licenses-cover-a-nltk-tagger-trained-on-treebank > BTW, there is a lot that can be done to bootstrap POS-taggers from raw data > and the tags in Wiktionary, so if folks are interested in that I'm happy to > provide pointers. As mentionned by Tommaso I think we should start to structure the wiki for this effort. Do you want me to create sub-pages of [1] for POS-tagging and NE detection? I could write the NE detection page with a description of the current effort on corpus-refiner / Walter and let you add pointers for the POS tags case. [1] https://cwiki.apache.org/OPENNLP/opennlp-annotations.html -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
