Thanks Joern! If i have understood you correctly ... IF i do not need relation between sentences i can skip the sentences detection right?
Il 26/Ago/2016 16:33, "Joern Kottmann" <[email protected]> ha scritto: > The name finder has the concept of "adaptive data" in the feature > generation. The feature generators can remember things from previous > sentences and use it to generate features based on it. Usually that can > help with the recognition rate if you have names that are repeated. You > can tweak this to your data, or just pass in the entire document. > > Jörn > > On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <[email protected]> > wrote: > > > Hi! > > Yes I can train a good model (sure It will takes a lot of time), i have > 30k > > resumes. So the "data" isnt a problem. > > I thought about many things, i am also creating a custom features > > generator, with dictionary too (for names) and regex for Birthday, then > > the machine learning will look at their contexts. > > So now i need to separate the sentences to create a custom model. > > At this point i will not try with one per line CV. > > > > Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <[email protected]> > ha > > scritto: > > > > Hi Damiano, > > I am not sure that the NameFinder will be effective as-is for you. Do > > you have training data (and I mean a lot of training data)? You need to > > consider what feature are useful in your case. You might consider a > > feature such as line number on the page (since people tend to put their > > name on the top or second line), maybe the font-size. You can add a > > dictionary of common names and have a feature “inDictionary”. You will > have > > to use your domain knowledge to help you here. > > > > For birthday you may want to consider using regex to pick out dates. > > Then look at the context around the date (words before/after, remove > > graduated or if another date just before) or maybe years before present > > year (if you are looking at resumes, you probably won’t find any 5 year > > olds or 200 year olds. > > > > Daniel Russ, Ph.D. > > Staff Scientist, Office of Intramural Research > > Center for Information Technology > > National Institutes of Health > > U.S. Department of Health and Human Services > > 12 South Drive > > Bethesda, MD 20892-5624 > > > > On Aug 26, 2016, at 5:57 AM, Damiano Porta <[email protected]< > mailto: > > [email protected]>> wrote: > > > > Hi Daniel! > > > > Thank you so much for your opinion. > > It makes perfectly sense. But i am still a bit confused about the length > of > > the sentences. > > In a resume there are many names, dates etc etc. So my doubt is regarding > > the structure of the sentences because they follow specific patterns > > sometimes. > > > > For example i need to extract the personal name, (Who wrote the resume) > the > > Birthday etc etc. > > > > As You know there are many names and dates inside a resume so i thought > > about to write the entire resume as sentence to also train the "position" > > less or more of the entities. If i "decompose" all the resume into > > sentences i will lose this information. No? > > > > Damiano > > > > Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <[email protected] > > <mailto:[email protected]>> ha > > scritto: > > > > Hi Damiano, > > > > Everyone can feel feel to correct my ignorance but I view the the > > name finder as follows. > > > > I look at it as walking down the sentence and classifying words as > > “NOT IN NAME” until I hit the start of a name than it is “START NAME”, > > Followed by “STILL IN NAME” until “NOT IN NAME”. Take the sentence “Did > > John eat the stew”. Starting with the first word in the sentence decide > > what are the odds that the first word starts a name (given that it is the > > first word happens to be “Did” in a sentence, with a capital but not all > > caps) starts a person’s name. Then go to then next word in the sentence. > > If the first word was not in a name, what are the odds that the second > word > > starts a name (given that the previous word did not start a name, the > word > > starts with a capital (but not all capital), the word is John, and the > > previous word is “Did”). If it decides that we are starting a name at > > “John”, we are now looking for the end. What are the odds that “eat” is > > part of the name given that [“Did”: was not part of the name, was > > capitalized] and that [“John”: was the first word in the name, was > > capitalized]. You are essentially classifying [Did <- OTHER] [John > > <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. If it was “Did John > > Smith eat the stew”. You would have [Did <- OTHER] [John > > <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. There are > > other features other than just word, previous word, and the shape (first > > letter capitalized, all letters capitalized). I think the name finder > uses > > part of speech also. > > > > > > So you see that it is not a name lookup table, but dependent on the > > previous classification of words earlier in the sentence. Therefore, you > > must have sentences. Does that help? > > Daniel > > > > > > Daniel Russ, Ph.D. > > Staff Scientist, Office of Intramural Research > > Center for Information Technology > > National Institutes of Health > > U.S. Department of Health and Human Services > > 12 South Drive > > Bethesda, MD 20892-5624 > > > > On Aug 25, 2016, at 9:55 AM, Damiano Porta <[email protected]< > mailto: > > [email protected]><mailto: > > [email protected]<mailto:[email protected]>>> wrote: > > > > Hello everybody! > > > > Could someone explain why should I separate each sentence of my documents > > to train my models? > > My documents are like resume/cv and the sentences can be very different. > > For example a sentence could also be : > > > > 1. Name: John > > 2. Surname: travolta > > > > Etc etc > > So my question is. What is the problem if i train ny models > > (namefinder,tokenizer) with the complete resume/cv one per line? > > > > Could It be a problem? > > In this case when i will like to tokenize the resume and doing the NER i > > will simply pass the complete resume text skiping the "sentences > detection" > > process. > > > > Thanks for your opinion in advance! > > > > Best > > Damiano > > >
