Re: How does good training data look like?

Jörn Kottmann Wed, 07 Dec 2011 09:15:39 -0800

Our default cutoff of 5 doesn't really work when
you just have 200 sentences of training data.


Jörn

On 12/7/11 5:46 PM, Vyacheslav Zholudev wrote:

Hi Em,

thanks for sharing your experience.
I can't give you any suggestions since I switched to another tool for 
named-entity recognition (called CRF++). It implements conditional random 
fields which allows to tag multiple types of entities, e.g. location and 
organization. This particular tool is kinda low-level: you have to convert your 
training data and a sentence to be tagged first to a specific format, but 
overall this tool works really well for me even on a small number of annotated 
examples (200 sentences for my particular use case)

Vyacheslav

On Dec 4, 2011, at 2:12 PM, Em wrote:

Hi Vyacheslav,

I started using OpenNLP in my free-time. As I promised, I share my
outcomes of some tests - although I did not hit the 2.500
tagged-sentences due to time-reasons.

Unfortunately I didn't had the time to backup my outcomes with numbers,
statistics or something like that (since I just played around with data
and OpenNLP).

However I saw a correlation between the length of the training-data and
the length of the data you want to tag with OpenNLP.

I took several sub-passages of articles (let's call them test-set) of
training data and tried to tag them.
The result of my test was: If your training data's average length was
"long" and you test it with short test-sets, precision and recall are
relatively bad.
For example I took 10 sentences of my training-data and tried out to tag
them, starting with using just one sentence and going to go up to 10.
Additionally I rewrote the sentences so that their sense would be the
same to a human, while looking completely different in comparison to the
original sentences. The results were the same.
If openNLP detected something in one-sentence-examples, it often was one
of 2-3 entities. Some sentences remained untouched although they
contained entities.
However, if I combined more than one sentence with eachother openNLP
became much more accurate (since they formed a more typical
document-length than the one-sentence-examples).

Training-data:
My training-data contained passages of wiki-articles. The average length
was several sentences per document. To me a document was everything that
belongs to a specific headline in a wikipedia-article, grouped by the
most specific headline.
So if a H3-tag contained several H4-tags, every content belonging to
that H4-tag formed a document.
As a consequence there were several longer documents, since h4-tags were
rarely used.

I decided to create a document for every most-specific headline, because
the wiki contained documents of every kind of length - some to fill
short books with 10 or more pages, others with just one sentences per
article. Choosing the most-specific headline to form a document made it
possible to have a good mix of document lengths.

I am glad to hear your feedback!

Hope this helps,
Em


Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev:

Hi Em,

could you please share the outcome when you have some results. I would be 
interested to hear them

Thanks,
Vyacheslav

On Oct 5, 2011, at 11:08 PM, Em wrote:

Thanks Jörn!

I'll experiment with this.

Regards,
Em

Am 05.10.2011 19:47, schrieb Jörn Kottmann:

On 10/3/11 10:30 AM, Em wrote:

What about document's length?
Just as an example: The production-data will contain documents with a
length of several pages as well as very short texts containing only a
few sentences.

I think about chunking the long documents into smaller ones (i.e. a page
of a longer document is splitted into an individual doc). Does this
makes sense?

I would first try to process a long document at once. If you encounter any
issues you could just call clearAdaptiveData before the end of the
document.
But as Olivier said, you might just want to include a couple of these in
your training
data.

Jörn

Best,
Vyacheslav

Re: How does good training data look like?

Reply via email to