Hi Eldad, Sorry for the late response....
1) Yes, I also have similar success and failure with the NameFinder. Hopefully, we can come up with better training data. The training data is simple for the NameFinder... basically, the NameFinder expects that the document has already been parsed with the Sentence Detector and the Tokenizer; though it isn't 100% required if you are training your own applications. Say you wanted to use the "Hi James," below although not a complete sentence, you would have the items on a separate line with the tokenizer actually producing the result of "Hi James ," ... notice the space between the James and the ','. The NameFinder expects the data tokenized as follows "Hi <START:person> James <END> ," ... notice the <START> and <END> tags for the sentence or partial in this case. The older models used just <START> and <END> without the qualifier specifying the type of tag. We've also found if you put "Mr" or "Mrs" prefixes to the name it also seems to recognize the names easier. Most of the training has been done on news articles and not everyday text. Jorn just started a project that the group has been discussing over many runs that involves collecting and parsing openly free data for the corpus. https://cwiki.apache.org/OPENNLP/opennlp-annotations.html Please feel free to join the discussion and help with the tasks. We are trying to provide open training sets to help with the issues of customizing and other issues related to using the copyrighted material for the models. James On 6/5/2011 6:52 AM, Eldad Yamin wrote: > Hi James, > > Thank you for your great response! > > 1. I already used the command (as described in the documentation) and got > some nice results. > > The only problem that I've found is with the NameFinder, It didn't > recognizer different names. > > Can you please explain how can I use the trainer to "make" him recognize > more names (Peoples, Places etc.)? > > > 2. Linked documents, in other words is related articals, for example (in > GATE): > > http://gate.ac.uk/biz/customers.html > > read the first paragraph under "media" > > > > 3. In addition, I have access to lots of texts/books that written in Hebrew, > how can I use it to train the nameFinder (I will contribute it back)? > > an again, tahnk you very much! > > On Sun, Jun 5, 2011 at 2:04 AM, James Kosin <[email protected]> wrote: > >> Eldad, >> >> It is possible. >> 1) This is easy enough with the current architecture and models. >> Basically, you have to pass in the document or paragraphs and parse into >> sentences using the SentenceDetector, which detects the sentences in the >> paragraph and returns a String array of sentences. Next the output from >> the sentence detector needs to be put through the Tokenizer, which takes >> the sentences and tokenizes into smaller parts. Usually words, but it >> also moves punctuation away from the words as well. This is done for >> each sentence and returns a string list of tokens. From here you have >> the raw data needed for most of the other models. From your >> description, you will want to use the NameFinder and the supporting >> models to tag the people, locations, and organizations and the like. >> >> 2) Not sure what you mean by link documents to others.... >> >> 3) We don't yet support all languages at the moment. Mostly because >> training and test data need to be collected over many months and parsed >> to be trained. Many groups have already done some work; unfortunately, >> most is copyrighted and difficult for everyone to get in some cases. >> >> This should get you started. >> http://incubator.apache.org/opennlp/documentation/manual/opennlp.html >> >> Download the release here... Don't forget the models toward the bottom. >> http://incubator.apache.org/opennlp/download.cgi >> >> Let us know if you need anything else. >> >> James >> >> >> On 6/4/2011 12:30 PM, Eldad Yamin wrote: >>> Hello everyone, >>> After researching about NLP I have found the OpenNLP as one of the most >>> promising solution at the moment. >>> however, I'm still looking for instruction on how to make the OpenNLP fit >> to >>> my needs. >>> >>> I need the OpenNLP to: >>> 1. get as input a sentence/paragraph and in return IE, annotation, named >>> entities (people, locations, organizations) and (numbers, dates, etc >> .). >>> 2. to use the OpenNLP to link documents to others. >>> 3. to support multi languages. >>> >>> Please advise, >>> Eldad. >>> >>
