Thanks Jörn I agree with your assessment. This is exactly where I am at the
moment and I am sure many others. You hit the nail on the head, currently
people have to start from scratch and thats daunting. For the phase when you
start crowd sourcing I am wondering what this web based UI would look like. I
am assuming that with some basic instructions things like:
- sentence boundary markup
- name identification (people, money, dates, locations, products)
These are narrowly focused crowd source-able with somewhat trivial ui tasks
(For the following sentences highlight names of people, such as "steve jobs",
"prince william")
When it comes to POS tagging (which is my current challenge) you can approach
it like the above ("For the following sentences select all the nouns"). And
re-assemble all the observations and perhaps use something like triple
judgements to look for disagreement, or you could have an editor that lets a
user markup the whole sentence (perhaps we fill in the parts we are guessing
already from a pre-learnt model). Not sure if the triple judgement is
necessary, maybe sentences labeled with a collage of people would still
converge well in training.
Both can be assisted by some prior trained model to help keep people awake and
on track :-} I think you mentioned in a prior mail that you can even use the
models that were built with proprietary data to bootstrap the assistance
process.
These are two ends of the spectrum, one assumes you are using people with
limited language skills and the other potentially much more competent. One you
need to gather data probably from many more people, one much less. Personally
I like the crowd sourced approach, but I wonder if OpenNLP could find enough
language "experts" per language that it makes better sense to build a non web
based app that perhaps is a little more expedient to operate.
For giggles assuming we needed to generate labels:
60k lines of text
average word length == 11
number of judgments ==3
We would be collecting almost 2M judgements from people that we would
reassemble into our training data after throwing out the bath water.
Maybe with the competent language expert case we only get a sentence judged
once by a person. There is perhaps no labeled sentence to be re-assembled, but
we may want to keep peoples judgements separate so we could validate their work
against others.
The data processing pipeline looks somewhat different in each case. The
competent POS labeler case simplifies the process greatly for the pipeline.
I would love to help in whatever way I can and can also find people to help
label data at my companies own expense to help accelerate this.
Best
C
On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:
> Hi all,
>
> based on some discussion we had in the past I put together
> a short proposal for a community based labeling project.
>
> Here is the link:
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>
> Any comments and opinions are very welcome.
>
> Thanks,
> Jörn