Hi Em, thanks for sharing your experience. I can't give you any suggestions since I switched to another tool for named-entity recognition (called CRF++). It implements conditional random fields which allows to tag multiple types of entities, e.g. location and organization. This particular tool is kinda low-level: you have to convert your training data and a sentence to be tagged first to a specific format, but overall this tool works really well for me even on a small number of annotated examples (200 sentences for my particular use case)
Vyacheslav On Dec 4, 2011, at 2:12 PM, Em wrote: > Hi Vyacheslav, > > I started using OpenNLP in my free-time. As I promised, I share my > outcomes of some tests - although I did not hit the 2.500 > tagged-sentences due to time-reasons. > > Unfortunately I didn't had the time to backup my outcomes with numbers, > statistics or something like that (since I just played around with data > and OpenNLP). > > However I saw a correlation between the length of the training-data and > the length of the data you want to tag with OpenNLP. > > I took several sub-passages of articles (let's call them test-set) of > training data and tried to tag them. > The result of my test was: If your training data's average length was > "long" and you test it with short test-sets, precision and recall are > relatively bad. > For example I took 10 sentences of my training-data and tried out to tag > them, starting with using just one sentence and going to go up to 10. > Additionally I rewrote the sentences so that their sense would be the > same to a human, while looking completely different in comparison to the > original sentences. The results were the same. > If openNLP detected something in one-sentence-examples, it often was one > of 2-3 entities. Some sentences remained untouched although they > contained entities. > However, if I combined more than one sentence with eachother openNLP > became much more accurate (since they formed a more typical > document-length than the one-sentence-examples). > > Training-data: > My training-data contained passages of wiki-articles. The average length > was several sentences per document. To me a document was everything that > belongs to a specific headline in a wikipedia-article, grouped by the > most specific headline. > So if a H3-tag contained several H4-tags, every content belonging to > that H4-tag formed a document. > As a consequence there were several longer documents, since h4-tags were > rarely used. > > I decided to create a document for every most-specific headline, because > the wiki contained documents of every kind of length - some to fill > short books with 10 or more pages, others with just one sentences per > article. Choosing the most-specific headline to form a document made it > possible to have a good mix of document lengths. > > I am glad to hear your feedback! > > Hope this helps, > Em > > > Am 05.10.2011 23:50, schrieb Vyacheslav Zholudev: >> Hi Em, >> >> could you please share the outcome when you have some results. I would be >> interested to hear them >> >> Thanks, >> Vyacheslav >> >> On Oct 5, 2011, at 11:08 PM, Em wrote: >> >>> Thanks Jörn! >>> >>> I'll experiment with this. >>> >>> Regards, >>> Em >>> >>> Am 05.10.2011 19:47, schrieb Jörn Kottmann: >>>> On 10/3/11 10:30 AM, Em wrote: >>>>> What about document's length? >>>>> Just as an example: The production-data will contain documents with a >>>>> length of several pages as well as very short texts containing only a >>>>> few sentences. >>>>> >>>>> I think about chunking the long documents into smaller ones (i.e. a page >>>>> of a longer document is splitted into an individual doc). Does this >>>>> makes sense? >>>> >>>> I would first try to process a long document at once. If you encounter any >>>> issues you could just call clearAdaptiveData before the end of the >>>> document. >>>> But as Olivier said, you might just want to include a couple of these in >>>> your training >>>> data. >>>> >>>> Jörn >>>> >> >> Best, >> Vyacheslav >> >> >> >>