Re: How does good training data look like?

Em Sun, 04 Dec 2011 05:12:48 -0800

Hi Vyacheslav,

I started using OpenNLP in my free-time. As I promised, I share my
outcomes of some tests - although I did not hit the 2.500
tagged-sentences due to time-reasons.

Unfortunately I didn't had the time to backup my outcomes with numbers,
statistics or something like that (since I just played around with data
and OpenNLP).

However I saw a correlation between the length of the training-data and
the length of the data you want to tag with OpenNLP.

I took several sub-passages of articles (let's call them test-set) of
training data and tried to tag them.
The result of my test was: If your training data's average length was
"long" and you test it with short test-sets, precision and recall are
relatively bad.
For example I took 10 sentences of my training-data and tried out to tag
them, starting with using just one sentence and going to go up to 10.
Additionally I rewrote the sentences so that their sense would be the
same to a human, while looking completely different in comparison to the
original sentences. The results were the same.
If openNLP detected something in one-sentence-examples, it often was one
of 2-3 entities. Some sentences remained untouched although they
contained entities.
However, if I combined more than one sentence with eachother openNLP
became much more accurate (since they formed a more typical
document-length than the one-sentence-examples).

Training-data:
My training-data contained passages of wiki-articles. The average length
was several sentences per document. To me a document was everything that
belongs to a specific headline in a wikipedia-article, grouped by the
most specific headline.
So if a H3-tag contained several H4-tags, every content belonging to
that H4-tag formed a document.
As a consequence there were several longer documents, since h4-tags were
rarely used.

I decided to create a document for every most-specific headline, because
the wiki contained documents of every kind of length - some to fill
short books with 10 or more pages, others with just one sentences per
article. Choosing the most-specific headline to form a document made it
possible to have a good mix of document lengths.

I am glad to hear your feedback!

Hope this helps,
Em

Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev:
> Hi Em,
> 
> could you please share the outcome when you have some results. I would be 
> interested to hear them
> 
> Thanks,
> Vyacheslav 
> 
> On Oct 5, 2011, at 11:08 PM, Em wrote:
> 
>> Thanks Jörn!
>>
>> I'll experiment with this.
>>
>> Regards,
>> Em
>>
>> Am 05.10.2011 19:47, schrieb Jörn Kottmann:
>>> On 10/3/11 10:30 AM, Em wrote:
>>>> What about document's length?
>>>> Just as an example: The production-data will contain documents with a
>>>> length of several pages as well as very short texts containing only a
>>>> few sentences.
>>>>
>>>> I think about chunking the long documents into smaller ones (i.e. a page
>>>> of a longer document is splitted into an individual doc). Does this
>>>> makes sense?
>>>
>>> I would first try to process a long document at once. If you encounter any
>>> issues you could just call clearAdaptiveData before the end of the
>>> document.
>>> But as Olivier said, you might just want to include a couple of these in
>>> your training
>>> data.
>>>
>>> Jörn
>>>
> 
> Best,
> Vyacheslav
> 
> 
> 
>

Re: How does good training data look like?

Reply via email to