Thanks Jörn I agree with your assessment.  This is exactly where I am at the 
moment and I am sure many others.  You hit the nail on the head, currently 
people have to start from scratch and thats daunting.  For the phase when you 
start crowd sourcing I am wondering what this web based UI would look like.  I 
am assuming that with some basic instructions things like:

- sentence boundary markup
- name identification (people, money, dates, locations, products)

These are narrowly focused crowd source-able with somewhat trivial ui tasks 
(For the following sentences highlight names of people, such as "steve jobs", 
"prince william")

When it comes to POS tagging (which is my current challenge) you can approach 
it like the above ("For the following sentences select all the nouns"). And 
re-assemble all the observations and perhaps use something like triple 
judgements to look for disagreement, or you could have an editor that lets a 
user markup the whole sentence (perhaps we fill in the parts we are guessing 
already from a pre-learnt model).  Not sure if the triple judgement is 
necessary, maybe sentences labeled with a collage of people would still 
converge well in training.

Both can be assisted by some prior trained model to help keep people awake and 
on track :-}  I think you mentioned in a prior mail that you can even use the 
models that were built with proprietary data to bootstrap the assistance 
process.

These are two ends of the spectrum, one assumes you are using people with 
limited language skills and the other potentially much more competent.  One you 
need to gather data probably from many more people, one much less.  Personally 
I like the crowd sourced approach, but I wonder if OpenNLP could find enough 
language "experts" per language that it makes better sense to build a non web 
based app that perhaps is a little more expedient to operate.

For giggles assuming we needed to generate labels:
60k lines of text
average word length == 11
number of judgments ==3

We would be collecting almost 2M judgements from people that we would 
reassemble into our training data after throwing out the bath water.

Maybe with the competent language expert case we only get a sentence judged 
once by a person.  There is perhaps no labeled sentence to be re-assembled, but 
we may want to keep peoples judgements separate so we could validate their work 
against others.

The data processing pipeline looks somewhat different in each case. The 
competent POS labeler case simplifies the process greatly for the pipeline.

I would love to help in whatever way I can and can also find people to help 
label data at my companies own expense to help accelerate this.

Best

C

On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:

> Hi all,
> 
> based on some discussion we had in the past I put together
> a short proposal for a community based labeling project.
> 
> Here is the link:
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
> 
> Any comments and opinions are very welcome.
> 
> Thanks,
> Jörn

Reply via email to