Just some additional question regarding training-data: While preparing training-data I am reevaluating all the time. Besides Crossvalidation I picked up some example sentences to find names of persons within them. At the moment this works pretty good - nothing is left, just some false-positives.
Is there a way to figure out why some spans were classified as names of persons? Alternatively: Is there a way of explicitly telling the model that some parts of a sentence are names but not persons? Thanks! Em Am 03.10.2011 10:30, schrieb Em: > Hi Oliver, > > thanks for your feedback! > > What about document's length? > Just as an example: The production-data will contain documents with a > length of several pages as well as very short texts containing only a > few sentences. > > I think about chunking the long documents into smaller ones (i.e. a page > of a longer document is splitted into an individual doc). Does this > makes sense? > > Regards, > Em > > Am 03.10.2011 01:50, schrieb Olivier Grisel: >> 2011/10/3 Em <mailformailingli...@yahoo.de>: >>> Hello list, >>> >>> I am currently trying to create a person-model for a specific domain for >>> testing purposes. >>> While the general suggestion is to have around 10k-15k sentences, I >>> retrain and reevaluate the outcome of my trainingdata while tagging new >>> sentences. >>> >>> At the moment I am under 1k sentences. However I asked myself whether it >>> makes sense to include sentences without persons or not. >>> While playing around there was no clear conclusion to draw: Precision >>> almost always increased when I included sentences without persons while >>> *sometimes* recall dropped a little bit. >>> >>> Is there a general direction for tagging training data? >>> >>> Btw.: This is the first time I am preparing training data. I never saw a >>> complete training-dataset before. >> >> The general rule of thumb is to have a training set that looks as much >> as possible like the data you will be applying your model to. So if >> you will encounter sentences without names in your production data, >> include a similar ratio in your training set. >> >