Hi Daniel! Thank you so much for your opinion. It makes perfectly sense. But i am still a bit confused about the length of the sentences. In a resume there are many names, dates etc etc. So my doubt is regarding the structure of the sentences because they follow specific patterns sometimes.
For example i need to extract the personal name, (Who wrote the resume) the Birthday etc etc. As You know there are many names and dates inside a resume so i thought about to write the entire resume as sentence to also train the "position" less or more of the entities. If i "decompose" all the resume into sentences i will lose this information. No? Damiano Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <[email protected]> ha scritto: > Hi Damiano, > > Everyone can feel feel to correct my ignorance but I view the the > name finder as follows. > > I look at it as walking down the sentence and classifying words as > “NOT IN NAME” until I hit the start of a name than it is “START NAME”, > Followed by “STILL IN NAME” until “NOT IN NAME”. Take the sentence “Did > John eat the stew”. Starting with the first word in the sentence decide > what are the odds that the first word starts a name (given that it is the > first word happens to be “Did” in a sentence, with a capital but not all > caps) starts a person’s name. Then go to then next word in the sentence. > If the first word was not in a name, what are the odds that the second word > starts a name (given that the previous word did not start a name, the word > starts with a capital (but not all capital), the word is John, and the > previous word is “Did”). If it decides that we are starting a name at > “John”, we are now looking for the end. What are the odds that “eat” is > part of the name given that [“Did”: was not part of the name, was > capitalized] and that [“John”: was the first word in the name, was > capitalized]. You are essentially classifying [Did <- OTHER] [John > <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. If it was “Did John > Smith eat the stew”. You would have [Did <- OTHER] [John > <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. There are > other features other than just word, previous word, and the shape (first > letter capitalized, all letters capitalized). I think the name finder uses > part of speech also. > > > So you see that it is not a name lookup table, but dependent on the > previous classification of words earlier in the sentence. Therefore, you > must have sentences. Does that help? > Daniel > > > Daniel Russ, Ph.D. > Staff Scientist, Office of Intramural Research > Center for Information Technology > National Institutes of Health > U.S. Department of Health and Human Services > 12 South Drive > Bethesda, MD 20892-5624 > > On Aug 25, 2016, at 9:55 AM, Damiano Porta <[email protected]<mailto: > [email protected]>> wrote: > > Hello everybody! > > Could someone explain why should I separate each sentence of my documents > to train my models? > My documents are like resume/cv and the sentences can be very different. > For example a sentence could also be : > > 1. Name: John > 2. Surname: travolta > > Etc etc > So my question is. What is the problem if i train ny models > (namefinder,tokenizer) with the complete resume/cv one per line? > > Could It be a problem? > In this case when i will like to tokenize the resume and doing the NER i > will simply pass the complete resume text skiping the "sentences detection" > process. > > Thanks for your opinion in advance! > > Best > Damiano > >
