I agree that free annotations on free corpora are the way forward. There are some ways to get this going, beyond waiting for volunteering:
- People like me apply for grant funds (e.g. NSF) to set up infrastructure and initial annotations in two languages. I'm currently fully loaded on cycles I have for being a principal investigator on grants, but would support an application by others, or wait until fall or so. (A big problem with this is the waiting time between applying for and receiving grant funds, plus extra cycles if one gets turned down.) - Companies who are using OpenNLP could contribute to a fund that would pay developers to build an annotation infrastructure and annotators to label things. This could be done somewhat in the Google summer of code mode, with students getting experience and some pay, working under the supervision of experienced developers. - Individual companies fund one-off small projects to develop particular corpora or annotation capabilities. The key here for companies is that the cost of creating the resources would be much less than it would be to develop in house and they have the potential to realize benefits from others contributing in like manner, or even just spurring the volunteer community by getting the ball rolling. Oh, and definite +1 to getting things released and compatible with old models. Jason On Tue, Feb 1, 2011 at 4:05 PM, Jörn Kottmann <[email protected]> wrote: > On 2/1/11 10:45 PM, Grant Ingersoll wrote: > >> Yes, we should start assembling a list of corpora, even so we at least >> have it for others that come later and want to reproduce them. In the >> meantime, I would agree that we can just keep the models elsewhere. We >> don't have to provide models. They are a convenience for all involved, but >> not a requirement in order to run. I wonder how many people actually train >> there own. (BTW, we should update our website to point to older models, >> too. They are really hard to find unless you do some URL rewriting.) >> > > OK, then lets get out the release as quickly as possible without depending > on the legal issues for the models > And lets do as much as possible to resolve these issues, just next to the > release work. I might have a > few spare cycles here and there to work on that. > > To get started with the legal stuff we need to compile a list with all the > necessary information, > that list will make a nice corpora page in our wiki. > Our documentation already contains instructions on how to train on some > freely available data. > > In the end I believe we are all best served with a wikinews corpus which > can be labeled by our community. > > Jörn > -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com
