Re: How does good training data look like?

Vyacheslav Zholudev Wed, 07 Dec 2011 08:46:32 -0800

Hi Em,

thanks for sharing your experience. 
I can't give you any suggestions since I switched to another tool for 
named-entity recognition (called CRF++). It implements conditional random 
fields which allows to tag multiple types of entities, e.g. location and 
organization. This particular tool is kinda low-level: you have to convert your 
training data and a sentence to be tagged first to a specific format, but 
overall this tool works really well for me even on a small number of annotated 
examples (200 sentences for my particular use case)


Vyacheslav

On Dec 4, 2011, at 2:12 PM, Em wrote:

> Hi Vyacheslav,
> 
> I started using OpenNLP in my free-time. As I promised, I share my
> outcomes of some tests - although I did not hit the 2.500
> tagged-sentences due to time-reasons.
> 
> Unfortunately I didn't had the time to backup my outcomes with numbers,
> statistics or something like that (since I just played around with data
> and OpenNLP).
> 
> However I saw a correlation between the length of the training-data and
> the length of the data you want to tag with OpenNLP.
> 
> I took several sub-passages of articles (let's call them test-set) of
> training data and tried to tag them.
> The result of my test was: If your training data's average length was
> "long" and you test it with short test-sets, precision and recall are
> relatively bad.
> For example I took 10 sentences of my training-data and tried out to tag
> them, starting with using just one sentence and going to go up to 10.
> Additionally I rewrote the sentences so that their sense would be the
> same to a human, while looking completely different in comparison to the
> original sentences. The results were the same.
> If openNLP detected something in one-sentence-examples, it often was one
> of 2-3 entities. Some sentences remained untouched although they
> contained entities.
> However, if I combined more than one sentence with eachother openNLP
> became much more accurate (since they formed a more typical
> document-length than the one-sentence-examples).
> 
> Training-data:
> My training-data contained passages of wiki-articles. The average length
> was several sentences per document. To me a document was everything that
> belongs to a specific headline in a wikipedia-article, grouped by the
> most specific headline.
> So if a H3-tag contained several H4-tags, every content belonging to
> that H4-tag formed a document.
> As a consequence there were several longer documents, since h4-tags were
> rarely used.
> 
> I decided to create a document for every most-specific headline, because
> the wiki contained documents of every kind of length - some to fill
> short books with 10 or more pages, others with just one sentences per
> article. Choosing the most-specific headline to form a document made it
> possible to have a good mix of document lengths.
> 
> I am glad to hear your feedback!
> 
> Hope this helps,
> Em
> 
> 
> Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev:
>> Hi Em,
>> 
>> could you please share the outcome when you have some results. I would be 
>> interested to hear them
>> 
>> Thanks,
>> Vyacheslav 
>> 
>> On Oct 5, 2011, at 11:08 PM, Em wrote:
>> 
>>> Thanks Jörn!
>>> 
>>> I'll experiment with this.
>>> 
>>> Regards,
>>> Em
>>> 
>>> Am 05.10.2011 19:47, schrieb Jörn Kottmann:
>>>> On 10/3/11 10:30 AM, Em wrote:
>>>>> What about document's length?
>>>>> Just as an example: The production-data will contain documents with a
>>>>> length of several pages as well as very short texts containing only a
>>>>> few sentences.
>>>>> 
>>>>> I think about chunking the long documents into smaller ones (i.e. a page
>>>>> of a longer document is splitted into an individual doc). Does this
>>>>> makes sense?
>>>> 
>>>> I would first try to process a long document at once. If you encounter any
>>>> issues you could just call clearAdaptiveData before the end of the
>>>> document.
>>>> But as Olivier said, you might just want to include a couple of these in
>>>> your training
>>>> data.
>>>> 
>>>> Jörn
>>>> 
>> 
>> Best,
>> Vyacheslav
>> 
>> 
>> 
>>

Re: How does good training data look like?

Reply via email to