I will definitely have a look at this tool, thanks for pointing out.
For the labeling part I believe we should do both crowd sourcing and with linguistic experts, even if we use the experts only to label some test data which we need to measure how good the crowd sourcing approach works. Lets try to extend the proposal a little so we have some sort of a plan which could get us started. Jörn On 6/8/11 6:36 PM, Jason Baldridge wrote:
+1 This is awesome. Here is a tool that could be relevant in getting the ball rolling on some datasets: http://code.google.com/p/dualist/ Jason On Tue, Jun 7, 2011 at 12:58 PM, Chris Collins<[email protected]>wrote:Thanks Jörn I agree with your assessment. This is exactly where I am at the moment and I am sure many others. You hit the nail on the head, currently people have to start from scratch and thats daunting. For the phase when you start crowd sourcing I am wondering what this web based UI would look like. I am assuming that with some basic instructions things like: - sentence boundary markup - name identification (people, money, dates, locations, products) These are narrowly focused crowd source-able with somewhat trivial ui tasks (For the following sentences highlight names of people, such as "steve jobs", "prince william") When it comes to POS tagging (which is my current challenge) you can approach it like the above ("For the following sentences select all the nouns"). And re-assemble all the observations and perhaps use something like triple judgements to look for disagreement, or you could have an editor that lets a user markup the whole sentence (perhaps we fill in the parts we are guessing already from a pre-learnt model). Not sure if the triple judgement is necessary, maybe sentences labeled with a collage of people would still converge well in training. Both can be assisted by some prior trained model to help keep people awake and on track :-} I think you mentioned in a prior mail that you can even use the models that were built with proprietary data to bootstrap the assistance process. These are two ends of the spectrum, one assumes you are using people with limited language skills and the other potentially much more competent. One you need to gather data probably from many more people, one much less. Personally I like the crowd sourced approach, but I wonder if OpenNLP could find enough language "experts" per language that it makes better sense to build a non web based app that perhaps is a little more expedient to operate. For giggles assuming we needed to generate labels: 60k lines of text average word length == 11 number of judgments ==3 We would be collecting almost 2M judgements from people that we would reassemble into our training data after throwing out the bath water. Maybe with the competent language expert case we only get a sentence judged once by a person. There is perhaps no labeled sentence to be re-assembled, but we may want to keep peoples judgements separate so we could validate their work against others. The data processing pipeline looks somewhat different in each case. The competent POS labeler case simplifies the process greatly for the pipeline. I would love to help in whatever way I can and can also find people to help label data at my companies own expense to help accelerate this. Best C On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:Hi all, based on some discussion we had in the past I put together a short proposal for a community based labeling project. Here is the link: https://cwiki.apache.org/OPENNLP/opennlp-annotations.html Any comments and opinions are very welcome. Thanks, Jörn
