Re: How does good training data look like?

Em Mon, 03 Oct 2011 01:30:46 -0700

Hi Oliver,

thanks for your feedback!


What about document's length?
Just as an example: The production-data will contain documents with a
length of several pages as well as very short texts containing only a
few sentences.

I think about chunking the long documents into smaller ones (i.e. a page
of a longer document is splitted into an individual doc). Does this
makes sense?

Regards,
Em

Am 03.10.2011 01:50, schrieb Olivier Grisel:
> 2011/10/3 Em <mailformailingli...@yahoo.de>:
>> Hello list,
>>
>> I am currently trying to create a person-model for a specific domain for
>> testing purposes.
>> While the general suggestion is to have around 10k-15k sentences, I
>> retrain and reevaluate the outcome of my trainingdata while tagging new
>> sentences.
>>
>> At the moment I am under 1k sentences. However I asked myself whether it
>> makes sense to include sentences without persons or not.
>> While playing around there was no clear conclusion to draw: Precision
>> almost always increased when I included sentences without persons while
>> *sometimes* recall dropped a little bit.
>>
>> Is there a general direction for tagging training data?
>>
>> Btw.: This is the first time I am preparing training data. I never saw a
>> complete training-dataset before.
> 
> The general rule of thumb is to have a training set that looks as much
> as possible like the data you will be applying your model to. So if
> you will encounter sentences without names in your production data,
> include a similar ratio in your training set.
>

Re: How does good training data look like?

Reply via email to