Hi Damiano, Once you decide not to use the NameFinder there are many ways to attack this problem. I am taking a shot in the dark at features, you will need to think about this. Maybe for every token (separated by whitespace) you create a feature isDate, maybe previous 2 tokens, next 2 token. Train the classifier. Of course this will be a little harder because you have to write all the code to read in the training data, which may not be trivial. Maybe your train example look like
<START:name>John Doe<END:name> 14 Maple Tree Court, Anytown MD 20000, born: <START:birthday>Jan 1, 1850<END:birthday> school: Sept 1861-June 1864 College experience 1901-present NIH-CIT 1864-1901 US Postal Service birthday you might see that birthdays often follow the word “born”, but the training needs to find it. Don’t forget that if you have structured data, you can use that to help the classification at any step. For instance if you already know the last name of the person. You can add a feature that checks if a token is the last name. So in the example if you know “Doe” is the last name. You could have a feature that checks if the next word is “Doe”. Daniel Daniel Russ, Ph.D. Staff Scientist, Office of Intramural Research Center for Information Technology National Institutes of Health U.S. Department of Health and Human Services 12 South Drive Bethesda, MD 20892-5624 On Aug 26, 2016, at 5:57 AM, Damiano Porta <[email protected]<mailto:[email protected]>> wrote: Hi Daniel! Thank you so much for your opinion. It makes perfectly sense. But i am still a bit confused about the length of the sentences. In a resume there are many names, dates etc etc. So my doubt is regarding the structure of the sentences because they follow specific patterns sometimes. For example i need to extract the personal name, (Who wrote the resume) the Birthday etc etc. As You know there are many names and dates inside a resume so i thought about to write the entire resume as sentence to also train the "position" less or more of the entities. If i "decompose" all the resume into sentences i will lose this information. No? Damiano Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <[email protected]<mailto:[email protected]>> ha scritto: Hi Damiano, Everyone can feel feel to correct my ignorance but I view the the name finder as follows. I look at it as walking down the sentence and classifying words as “NOT IN NAME” until I hit the start of a name than it is “START NAME”, Followed by “STILL IN NAME” until “NOT IN NAME”. Take the sentence “Did John eat the stew”. Starting with the first word in the sentence decide what are the odds that the first word starts a name (given that it is the first word happens to be “Did” in a sentence, with a capital but not all caps) starts a person’s name. Then go to then next word in the sentence. If the first word was not in a name, what are the odds that the second word starts a name (given that the previous word did not start a name, the word starts with a capital (but not all capital), the word is John, and the previous word is “Did”). If it decides that we are starting a name at “John”, we are now looking for the end. What are the odds that “eat” is part of the name given that [“Did”: was not part of the name, was capitalized] and that [“John”: was the first word in the name, was capitalized]. You are essentially classifying [Did <- OTHER] [John <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. If it was “Did John Smith eat the stew”. You would have [Did <- OTHER] [John <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. There are other features other than just word, previous word, and the shape (first letter capitalized, all letters capitalized). I think the name finder uses part of speech also. So you see that it is not a name lookup table, but dependent on the previous classification of words earlier in the sentence. Therefore, you must have sentences. Does that help? Daniel Daniel Russ, Ph.D. Staff Scientist, Office of Intramural Research Center for Information Technology National Institutes of Health U.S. Department of Health and Human Services 12 South Drive Bethesda, MD 20892-5624 On Aug 25, 2016, at 9:55 AM, Damiano Porta <[email protected]<mailto:[email protected]><mailto: [email protected]<mailto:[email protected]>>> wrote: Hello everybody! Could someone explain why should I separate each sentence of my documents to train my models? My documents are like resume/cv and the sentences can be very different. For example a sentence could also be : 1. Name: John 2. Surname: travolta Etc etc So my question is. What is the problem if i train ny models (namefinder,tokenizer) with the complete resume/cv one per line? Could It be a problem? In this case when i will like to tokenize the resume and doing the NER i will simply pass the complete resume text skiping the "sentences detection" process. Thanks for your opinion in advance! Best Damiano
