Hi Eldad,

Sorry for the late response....

1)  Yes, I also have similar success and failure with the NameFinder. 
Hopefully, we can come up with better training data.  The training data
is simple for the NameFinder... basically, the NameFinder expects that
the document has already been parsed with the Sentence Detector and the
Tokenizer; though it isn't 100% required if you are training your own
applications.

Say you wanted to use the "Hi James," below although not a complete
sentence, you would have the items on a separate line with the tokenizer
actually producing the result of "Hi James ," ... notice the space
between the James and the ','.  The NameFinder expects the data
tokenized as follows "Hi <START:person> James <END> ," ... notice the
<START> and <END> tags for the sentence or partial in this case.  The
older models used just <START> and <END> without the qualifier
specifying the type of tag.

We've also found if you put "Mr" or "Mrs" prefixes to the name it also
seems to recognize the names easier.  Most of the training has been done
on news articles and not everyday text.

Jorn just started a project that the group has been discussing over many
runs that involves collecting and parsing openly free data for the
corpus.   https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
Please feel free to join the discussion and help with the tasks.  We are
trying to provide open training sets to help with the issues of
customizing and other issues related to using the copyrighted material
for the models.

James


On 6/5/2011 6:52 AM, Eldad Yamin wrote:
> Hi James,
>
> Thank you for your great response!
>
> 1. I already used the command (as described in the documentation) and got
> some nice results.
>
> The only problem that I've found is with the NameFinder, It didn't
> recognizer different names.
>
> Can you please explain how can I use the trainer to "make" him recognize
> more names (Peoples, Places etc.)?
>
>
> 2. Linked documents, in other words is related articals, for example (in
> GATE):
>
> http://gate.ac.uk/biz/customers.html
>
> read the first paragraph under "media"
>
>
>
> 3. In addition, I have access to lots of texts/books that written in Hebrew,
> how can I use it to train the nameFinder (I will contribute it back)?
>
> an again, tahnk you very much!
>
> On Sun, Jun 5, 2011 at 2:04 AM, James Kosin <[email protected]> wrote:
>
>> Eldad,
>>
>> It is possible.
>> 1)  This is easy enough with the current architecture and models.
>> Basically, you have to pass in the document or paragraphs and parse into
>> sentences using the SentenceDetector, which detects the sentences in the
>> paragraph and returns a String array of sentences.  Next the output from
>> the sentence detector needs to be put through the Tokenizer, which takes
>> the sentences and tokenizes into smaller parts.  Usually words, but it
>> also moves punctuation away from the words as well.  This is done for
>> each sentence and returns a string list of tokens.   From here you have
>> the raw data needed for most of the other models.  From your
>> description, you will want to use the NameFinder and the supporting
>> models to tag the people, locations, and organizations and the like.
>>
>> 2)  Not sure what you mean by link documents to others....
>>
>> 3)  We don't yet support all languages at the moment.  Mostly because
>> training and test data need to be collected over many months and parsed
>> to be trained.  Many groups have already done some work; unfortunately,
>> most is copyrighted and difficult for everyone to get in some cases.
>>
>> This should get you started.
>> http://incubator.apache.org/opennlp/documentation/manual/opennlp.html
>>
>> Download the release here...  Don't forget the models toward the bottom.
>> http://incubator.apache.org/opennlp/download.cgi
>>
>> Let us know if you need anything else.
>>
>> James
>>
>>
>> On 6/4/2011 12:30 PM, Eldad Yamin wrote:
>>> Hello everyone,
>>> After researching about NLP I have found the OpenNLP as one of the most
>>> promising solution at the moment.
>>> however, I'm still looking for instruction on how to make the OpenNLP fit
>> to
>>> my needs.
>>>
>>> I need the OpenNLP to:
>>> 1. get as input a sentence/paragraph and in return IE, annotation, named
>>> entities (people, locations, organizations) and   (numbers, dates, etc
>> .).
>>> 2. to use the OpenNLP to link documents to others.
>>> 3. to support multi languages.
>>>
>>> Please advise,
>>> Eldad.
>>>
>>

Reply via email to