I will definitely have a look at this tool, thanks for pointing
out.

For the labeling part I believe we should do both crowd sourcing
and with linguistic experts, even if we use the experts only to
label some test data which we need to measure how good the
crowd sourcing approach works.

Lets try to extend the proposal a little so we have some sort of a plan
which could get us started.

Jörn

On 6/8/11 6:36 PM, Jason Baldridge wrote:
+1 This is awesome.

Here is a tool that could be relevant in getting the ball rolling on some
datasets:

http://code.google.com/p/dualist/

Jason

On Tue, Jun 7, 2011 at 12:58 PM, Chris Collins<[email protected]>wrote:

Thanks Jörn I agree with your assessment.  This is exactly where I am at
the moment and I am sure many others.  You hit the nail on the head,
currently people have to start from scratch and thats daunting.  For the
phase when you start crowd sourcing I am wondering what this web based UI
would look like.  I am assuming that with some basic instructions things
like:

- sentence boundary markup
- name identification (people, money, dates, locations, products)

These are narrowly focused crowd source-able with somewhat trivial ui tasks
(For the following sentences highlight names of people, such as "steve
jobs", "prince william")

When it comes to POS tagging (which is my current challenge) you can
approach it like the above ("For the following sentences select all the
nouns"). And re-assemble all the observations and perhaps use something like
triple judgements to look for disagreement, or you could have an editor that
lets a user markup the whole sentence (perhaps we fill in the parts we are
guessing already from a pre-learnt model).  Not sure if the triple judgement
is necessary, maybe sentences labeled with a collage of people would still
converge well in training.

Both can be assisted by some prior trained model to help keep people awake
and on track :-}  I think you mentioned in a prior mail that you can even
use the models that were built with proprietary data to bootstrap the
assistance process.

These are two ends of the spectrum, one assumes you are using people with
limited language skills and the other potentially much more competent.  One
you need to gather data probably from many more people, one much less.
  Personally I like the crowd sourced approach, but I wonder if OpenNLP could
find enough language "experts" per language that it makes better sense to
build a non web based app that perhaps is a little more expedient to
operate.

For giggles assuming we needed to generate labels:
60k lines of text
average word length == 11
number of judgments ==3

We would be collecting almost 2M judgements from people that we would
reassemble into our training data after throwing out the bath water.

Maybe with the competent language expert case we only get a sentence judged
once by a person.  There is perhaps no labeled sentence to be re-assembled,
but we may want to keep peoples judgements separate so we could validate
their work against others.

The data processing pipeline looks somewhat different in each case. The
competent POS labeler case simplifies the process greatly for the pipeline.

I would love to help in whatever way I can and can also find people to help
label data at my companies own expense to help accelerate this.

Best

C

On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:

Hi all,

based on some discussion we had in the past I put together
a short proposal for a community based labeling project.

Here is the link:
https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

Any comments and opinions are very welcome.

Thanks,
Jörn



Reply via email to