+1 This is awesome. Here is a tool that could be relevant in getting the ball rolling on some datasets:
http://code.google.com/p/dualist/ Jason On Tue, Jun 7, 2011 at 12:58 PM, Chris Collins <[email protected]>wrote: > Thanks Jörn I agree with your assessment. This is exactly where I am at > the moment and I am sure many others. You hit the nail on the head, > currently people have to start from scratch and thats daunting. For the > phase when you start crowd sourcing I am wondering what this web based UI > would look like. I am assuming that with some basic instructions things > like: > > - sentence boundary markup > - name identification (people, money, dates, locations, products) > > These are narrowly focused crowd source-able with somewhat trivial ui tasks > (For the following sentences highlight names of people, such as "steve > jobs", "prince william") > > When it comes to POS tagging (which is my current challenge) you can > approach it like the above ("For the following sentences select all the > nouns"). And re-assemble all the observations and perhaps use something like > triple judgements to look for disagreement, or you could have an editor that > lets a user markup the whole sentence (perhaps we fill in the parts we are > guessing already from a pre-learnt model). Not sure if the triple judgement > is necessary, maybe sentences labeled with a collage of people would still > converge well in training. > > Both can be assisted by some prior trained model to help keep people awake > and on track :-} I think you mentioned in a prior mail that you can even > use the models that were built with proprietary data to bootstrap the > assistance process. > > These are two ends of the spectrum, one assumes you are using people with > limited language skills and the other potentially much more competent. One > you need to gather data probably from many more people, one much less. > Personally I like the crowd sourced approach, but I wonder if OpenNLP could > find enough language "experts" per language that it makes better sense to > build a non web based app that perhaps is a little more expedient to > operate. > > For giggles assuming we needed to generate labels: > 60k lines of text > average word length == 11 > number of judgments ==3 > > We would be collecting almost 2M judgements from people that we would > reassemble into our training data after throwing out the bath water. > > Maybe with the competent language expert case we only get a sentence judged > once by a person. There is perhaps no labeled sentence to be re-assembled, > but we may want to keep peoples judgements separate so we could validate > their work against others. > > The data processing pipeline looks somewhat different in each case. The > competent POS labeler case simplifies the process greatly for the pipeline. > > I would love to help in whatever way I can and can also find people to help > label data at my companies own expense to help accelerate this. > > Best > > C > > On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote: > > > Hi all, > > > > based on some discussion we had in the past I put together > > a short proposal for a community based labeling project. > > > > Here is the link: > > https://cwiki.apache.org/OPENNLP/opennlp-annotations.html > > > > Any comments and opinions are very welcome. > > > > Thanks, > > Jörn > > -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
