Just some additional question regarding training-data:

While preparing training-data I am reevaluating all the time. Besides
Crossvalidation I picked up some example sentences to find names of
persons within them. At the moment this works pretty good - nothing is
left, just some false-positives.

Is there a way to figure out why some spans were classified as names of
persons?
Alternatively: Is there a way of explicitly telling the model that some
parts of a sentence are names but not persons?

Thanks!

Em

Am 03.10.2011 10:30, schrieb Em:
> Hi Oliver,
> 
> thanks for your feedback!
> 
> What about document's length?
> Just as an example: The production-data will contain documents with a
> length of several pages as well as very short texts containing only a
> few sentences.
> 
> I think about chunking the long documents into smaller ones (i.e. a page
> of a longer document is splitted into an individual doc). Does this
> makes sense?
> 
> Regards,
> Em
> 
> Am 03.10.2011 01:50, schrieb Olivier Grisel:
>> 2011/10/3 Em <mailformailingli...@yahoo.de>:
>>> Hello list,
>>>
>>> I am currently trying to create a person-model for a specific domain for
>>> testing purposes.
>>> While the general suggestion is to have around 10k-15k sentences, I
>>> retrain and reevaluate the outcome of my trainingdata while tagging new
>>> sentences.
>>>
>>> At the moment I am under 1k sentences. However I asked myself whether it
>>> makes sense to include sentences without persons or not.
>>> While playing around there was no clear conclusion to draw: Precision
>>> almost always increased when I included sentences without persons while
>>> *sometimes* recall dropped a little bit.
>>>
>>> Is there a general direction for tagging training data?
>>>
>>> Btw.: This is the first time I am preparing training data. I never saw a
>>> complete training-dataset before.
>>
>> The general rule of thumb is to have a training set that looks as much
>> as possible like the data you will be applying your model to. So if
>> you will encounter sentences without names in your production data,
>> include a similar ratio in your training set.
>>
> 

Reply via email to