Hi Oliver, thanks for your feedback!
What about document's length? Just as an example: The production-data will contain documents with a length of several pages as well as very short texts containing only a few sentences. I think about chunking the long documents into smaller ones (i.e. a page of a longer document is splitted into an individual doc). Does this makes sense? Regards, Em Am 03.10.2011 01:50, schrieb Olivier Grisel: > 2011/10/3 Em <mailformailingli...@yahoo.de>: >> Hello list, >> >> I am currently trying to create a person-model for a specific domain for >> testing purposes. >> While the general suggestion is to have around 10k-15k sentences, I >> retrain and reevaluate the outcome of my trainingdata while tagging new >> sentences. >> >> At the moment I am under 1k sentences. However I asked myself whether it >> makes sense to include sentences without persons or not. >> While playing around there was no clear conclusion to draw: Precision >> almost always increased when I included sentences without persons while >> *sometimes* recall dropped a little bit. >> >> Is there a general direction for tagging training data? >> >> Btw.: This is the first time I am preparing training data. I never saw a >> complete training-dataset before. > > The general rule of thumb is to have a training set that looks as much > as possible like the data you will be applying your model to. So if > you will encounter sentences without names in your production data, > include a similar ratio in your training set. >