Hi Damiano,

     Everyone can feel feel to correct my ignorance but I view the the name 
finder as follows.

     I look at it as walking down the sentence and classifying words as “NOT IN 
NAME”  until I hit the start of a name than it is “START NAME”, Followed by 
“STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did John eat the 
stew”.  Starting with the first word in the sentence decide what are the odds 
that the first word starts a name (given that it is the first word happens to 
be “Did” in a sentence, with a capital but not all caps) starts a person’s 
name.  Then go to then next word in the sentence.  If the first word was not in 
a name, what are the odds that the second word starts a name (given that the 
previous word did not start a name, the word starts with a capital (but not all 
capital), the word is John, and the previous word is “Did”).  If it decides 
that we are starting a name at “John”, we are now looking for the end.  What 
are the odds that “eat” is part of the name given that [“Did”: was not part of 
the name, was capitalized] and that [“John”: was the first word in the name, 
was capitalized].   You are essentially classifying [Did <- OTHER] [John 
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John Smith 
eat the stew”.  You would have [Did <- OTHER] [John <-START][Smith<-IN] 
[eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are other features other than 
just word, previous word, and the shape (first letter capitalized, all letters 
capitalized).  I think the name finder uses part of speech also.


    So you see that it is not a name lookup table, but dependent on the 
previous classification of words earlier in the sentence.  Therefore, you must 
have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta 
<[email protected]<mailto:[email protected]>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences detection"
process.

Thanks for your opinion in advance!

Best
Damiano

Reply via email to