Hi Vyacheslav, I started using OpenNLP in my free-time. As I promised, I share my outcomes of some tests - although I did not hit the 2.500 tagged-sentences due to time-reasons.
Unfortunately I didn't had the time to backup my outcomes with numbers, statistics or something like that (since I just played around with data and OpenNLP). However I saw a correlation between the length of the training-data and the length of the data you want to tag with OpenNLP. I took several sub-passages of articles (let's call them test-set) of training data and tried to tag them. The result of my test was: If your training data's average length was "long" and you test it with short test-sets, precision and recall are relatively bad. For example I took 10 sentences of my training-data and tried out to tag them, starting with using just one sentence and going to go up to 10. Additionally I rewrote the sentences so that their sense would be the same to a human, while looking completely different in comparison to the original sentences. The results were the same. If openNLP detected something in one-sentence-examples, it often was one of 2-3 entities. Some sentences remained untouched although they contained entities. However, if I combined more than one sentence with eachother openNLP became much more accurate (since they formed a more typical document-length than the one-sentence-examples). Training-data: My training-data contained passages of wiki-articles. The average length was several sentences per document. To me a document was everything that belongs to a specific headline in a wikipedia-article, grouped by the most specific headline. So if a H3-tag contained several H4-tags, every content belonging to that H4-tag formed a document. As a consequence there were several longer documents, since h4-tags were rarely used. I decided to create a document for every most-specific headline, because the wiki contained documents of every kind of length - some to fill short books with 10 or more pages, others with just one sentences per article. Choosing the most-specific headline to form a document made it possible to have a good mix of document lengths. I am glad to hear your feedback! Hope this helps, Em Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev: > Hi Em, > > could you please share the outcome when you have some results. I would be > interested to hear them > > Thanks, > Vyacheslav > > On Oct 5, 2011, at 11:08 PM, Em wrote: > >> Thanks Jörn! >> >> I'll experiment with this. >> >> Regards, >> Em >> >> Am 05.10.2011 19:47, schrieb Jörn Kottmann: >>> On 10/3/11 10:30 AM, Em wrote: >>>> What about document's length? >>>> Just as an example: The production-data will contain documents with a >>>> length of several pages as well as very short texts containing only a >>>> few sentences. >>>> >>>> I think about chunking the long documents into smaller ones (i.e. a page >>>> of a longer document is splitted into an individual doc). Does this >>>> makes sense? >>> >>> I would first try to process a long document at once. If you encounter any >>> issues you could just call clearAdaptiveData before the end of the >>> document. >>> But as Olivier said, you might just want to include a couple of these in >>> your training >>> data. >>> >>> Jörn >>> > > Best, > Vyacheslav > > > >