Hi Damiano,
   I am not sure that the NameFinder will be effective as-is for you.  Do you 
have training data (and I mean a lot of training data)?  You need to consider 
what feature are useful in your case.  You might consider a feature such as 
line number on the page (since people tend to put their name on the top or 
second line), maybe the font-size.  You can add a dictionary of common names 
and have a feature “inDictionary”. You will have to use your domain knowledge 
to help you here.

  For birthday you may want to consider using regex to pick out dates.  Then 
look at the context around the date (words before/after, remove graduated or if 
another date just before) or maybe years before present year (if you are 
looking at resumes, you probably won’t find any 5 year olds or 200 year olds.

Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 26, 2016, at 5:57 AM, Damiano Porta 
<[email protected]<mailto:[email protected]>> wrote:

Hi Daniel!

Thank you so much for your opinion.
It makes perfectly sense. But i am still a bit confused about the length of
the sentences.
In a resume there are many names, dates etc etc. So my doubt is regarding
the structure of the sentences because they follow specific patterns
sometimes.

For example i need to extract the personal name, (Who wrote the resume) the
Birthday etc etc.

As You know there are many names and dates inside a resume so i thought
about to write the entire resume as sentence to also train the "position"
less or more of the entities. If i "decompose" all the resume into
sentences i will lose this information. No?

Damiano

Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" 
<[email protected]<mailto:[email protected]>> ha
scritto:

Hi Damiano,

    Everyone can feel feel to correct my ignorance but I view the the
name finder as follows.

    I look at it as walking down the sentence and classifying words as
“NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
John eat the stew”.  Starting with the first word in the sentence decide
what are the odds that the first word starts a name (given that it is the
first word happens to be “Did” in a sentence, with a capital but not all
caps) starts a person’s name.  Then go to then next word in the sentence.
If the first word was not in a name, what are the odds that the second word
starts a name (given that the previous word did not start a name, the word
starts with a capital (but not all capital), the word is John, and the
previous word is “Did”).  If it decides that we are starting a name at
“John”, we are now looking for the end.  What are the odds that “eat” is
part of the name given that [“Did”: was not part of the name, was
capitalized] and that [“John”: was the first word in the name, was
capitalized].   You are essentially classifying [Did <- OTHER] [John
<-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
Smith eat the stew”.  You would have [Did <- OTHER] [John
<-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
other features other than just word, previous word, and the shape (first
letter capitalized, all letters capitalized).  I think the name finder uses
part of speech also.


   So you see that it is not a name lookup table, but dependent on the
previous classification of words earlier in the sentence.  Therefore, you
must have sentences. Does that help?
Daniel


Daniel Russ, Ph.D.
Staff Scientist, Office of Intramural Research
Center for Information Technology
National Institutes of Health
U.S. Department of Health and Human Services
12 South Drive
Bethesda,  MD 20892-5624

On Aug 25, 2016, at 9:55 AM, Damiano Porta 
<[email protected]<mailto:[email protected]><mailto:
[email protected]<mailto:[email protected]>>> wrote:

Hello everybody!

Could someone explain why should I separate each sentence of my documents
to train my models?
My documents are like resume/cv and the sentences can be very different.
For example a sentence could also be :

1. Name: John
2. Surname: travolta

Etc etc
So my question is. What is the problem if i train ny models
(namefinder,tokenizer) with the complete resume/cv one per line?

Could It be a problem?
In this case when i will like to tokenize the resume and doing the NER i
will simply pass the complete resume text skiping the "sentences detection"
process.

Thanks for your opinion in advance!

Best
Damiano

Reply via email to