Can you post the peace of code where you use the API so we can check if it
is OK?

To create a analyzer that can process texts, you should use the
SentenceDetector, the Tokenizer and finally the Name Finder.
Make sure you are creating the data structures correctly. Refer to the
documentation to learn the input and output of each module of the pipe.

Yes, you should try to use the sentence detector and tokenizer models
distributed by OpenNLP.


On Fri, Feb 10, 2012 at 7:53 PM, Jim - FooBar(); <jimpil1...@gmail.com>wrote:

> On 09/02/12 09:21, Joern Kottmann wrote:
>
>> In the NER training format all tokens are separated by white spaces.
>> So you always need a space between two tokens.
>>
>> Try our command line tokenizer, it will output white space tokenized text.
>>
> Hi there Joern,
>
> I followed your suggestion and it turns out that by using the command line
> tool i am indeed able to recognise multi-word tokens!!! This is all i've
> been hoping for honestly...
>
> HOWEVER, I was not able to reproduce it in my program no matter what!!!
> You see,
>
>  * i devoted a whole day into properly merging 3 papers so i have some
>   training material
>  * I used the command line to to train a nameFinder on with that tr. data.
>  * I then  used  the command line tool again passing it the model i
>   just trained and 4 sentences that contained some obvious drug-names
>  * and it came back with correct annotations which means it did
>   recognise everything. Even Folic acid!!! This is good news....
>  * I then tried to load that same model in my program and pass it the
>   exact same sentences but now it comes back with single-word
>   entities. 2 of them may be part of the actual ones (folic is part of
>   "folic acid") but why am i not getting the full names?
>
> Now you said last that i should not be using the pre-trained maxent
> english tokenizer becasue it does not add spaces to the tokens. I have
> tried surrounding the tokens with spaces before the nameFinder sees them
> but that is obviously wrong because i'm not getting anything back at all!!!
> Not even the single word entities or just "folic"... Also, what you said
> made me wonder...
>
> If tokenization is so important between training and runtime then what
> tokenizer does the command line tool use? Does it not use the pre-trained
> english one when you try to query a model with some sentences? If entities
> in the training data for the nameFinder HAVE TO  be surrounded by spaces
> wouldn't it make sense for the pre-trained tokenizer to do the same? and
> vice-versa... Since the english tokenizer does not include spaces would it
> not make more sense to NOT have spaces in the training data as well so the
> 2 can co-operate?
>
> Using the command line tool i did not have to do any tokenization so
> presumably it happens internally...
> What tokenizer is being used?
> Why can i not reproduce the results from my program even though i'm using
> the right model?
>  Just as a reference i shall mention that i'm suing a buffered reader to
> read the lines from the file that contains the 4 sentences. The 4 sentences
> just  happen to  be on separate lines, but sentence-detector can cope with
> that...
>
> Thanks in advance,
> Jim
>
>

Reply via email to