Pardon i meant the "my" word ... Il 26/Ago/2016 20:49, "Damiano Porta" <[email protected]> ha scritto:
> But i think It is the same no? I Mean. ..I will pass all the content as > one sentence. So in this case the "the" word will be tagged the same. > > The problem in this case is that i need to create a tagger model too... > > Il 26/Ago/2016 20:14, "Russ, Daniel (NIH/CIT) [E]" <[email protected]> > ha scritto: > >> The POSTaggerME uses tokenized sentences. In your example, both cases >> have 2 sentences. sentence 1=My name is Damiano. sentence 2=My surname is >> Porta.. >> >> POSTaggerME tagger=… >> tagger.tag(new String[]{ “My”,”name”,”is”,”Damiano”}); >> >> Daniel Russ, Ph.D. >> Staff Scientist, Office of Intramural Research >> Center for Information Technology >> National Institutes of Health >> U.S. Department of Health and Human Services >> 12 South Drive >> Bethesda, MD 20892-5624 >> >> On Aug 26, 2016, at 1:46 PM, Damiano Porta <[email protected]<mailto >> :[email protected]>> wrote: >> >> Hmmm why? >> If i use the postagger for: >> "My name is Damiano. My surname is Porta" >> >> OR separate: >> >> My name is Damiano. >> My surname is Porta. >> >> I think the tags will be the same, no? >> >> Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <[email protected] >> <mailto:[email protected]>> ha >> scritto: >> >> If you want to use the part of speech (from the POSTaggerME) as a feature, >> you will need sentences. >> >> Daniel Russ, Ph.D. >> Staff Scientist, Office of Intramural Research >> Center for Information Technology >> National Institutes of Health >> U.S. Department of Health and Human Services >> 12 South Drive >> Bethesda, MD 20892-5624 >> >> On Aug 26, 2016, at 12:15 PM, Damiano Porta <[email protected] >> <mailto:[email protected]><mailto: >> [email protected]<mailto:[email protected]>>> wrote: >> >> Thanks Joern! >> If i have understood you correctly ... >> IF i do not need relation between sentences i can skip the sentences >> detection right? >> >> Il 26/Ago/2016 16:33, "Joern Kottmann" <[email protected]<mailto:kot >> [email protected]><mailto:kot >> [email protected]<mailto:[email protected]>>> ha scritto: >> >> The name finder has the concept of "adaptive data" in the feature >> generation. The feature generators can remember things from previous >> sentences and use it to generate features based on it. Usually that can >> help with the recognition rate if you have names that are repeated. You >> can tweak this to your data, or just pass in the entire document. >> >> Jörn >> >> On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <[email protected] >> <mailto:[email protected]>< >> mailto:[email protected]>> >> wrote: >> >> Hi! >> Yes I can train a good model (sure It will takes a lot of time), i have >> 30k >> resumes. So the "data" isnt a problem. >> I thought about many things, i am also creating a custom features >> generator, with dictionary too (for names) and regex for Birthday, then >> the machine learning will look at their contexts. >> So now i need to separate the sentences to create a custom model. >> At this point i will not try with one per line CV. >> >> Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected]>> >> ha >> scritto: >> >> Hi Damiano, >> I am not sure that the NameFinder will be effective as-is for you. Do >> you have training data (and I mean a lot of training data)? You need to >> consider what feature are useful in your case. You might consider a >> feature such as line number on the page (since people tend to put their >> name on the top or second line), maybe the font-size. You can add a >> dictionary of common names and have a feature “inDictionary”. You will >> have >> to use your domain knowledge to help you here. >> >> For birthday you may want to consider using regex to pick out dates. >> Then look at the context around the date (words before/after, remove >> graduated or if another date just before) or maybe years before present >> year (if you are looking at resumes, you probably won’t find any 5 year >> olds or 200 year olds. >> >> Daniel Russ, Ph.D. >> Staff Scientist, Office of Intramural Research >> Center for Information Technology >> National Institutes of Health >> U.S. Department of Health and Human Services >> 12 South Drive >> Bethesda, MD 20892-5624 >> >> On Aug 26, 2016, at 5:57 AM, Damiano Porta <[email protected]<mailto >> :[email protected]><mailto: >> [email protected]<mailto:[email protected]>>< >> mailto: >> [email protected]<mailto:[email protected]><mailto: >> [email protected]>>> wrote: >> >> Hi Daniel! >> >> Thank you so much for your opinion. >> It makes perfectly sense. But i am still a bit confused about the length >> of >> the sentences. >> In a resume there are many names, dates etc etc. So my doubt is regarding >> the structure of the sentences because they follow specific patterns >> sometimes. >> >> For example i need to extract the personal name, (Who wrote the resume) >> the >> Birthday etc etc. >> >> As You know there are many names and dates inside a resume so i thought >> about to write the entire resume as sentence to also train the "position" >> less or more of the entities. If i "decompose" all the resume into >> sentences i will lose this information. No? >> >> Damiano >> >> Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected]> >> <mailto:[email protected]>> ha >> scritto: >> >> Hi Damiano, >> >> Everyone can feel feel to correct my ignorance but I view the the >> name finder as follows. >> >> I look at it as walking down the sentence and classifying words as >> “NOT IN NAME” until I hit the start of a name than it is “START NAME”, >> Followed by “STILL IN NAME” until “NOT IN NAME”. Take the sentence “Did >> John eat the stew”. Starting with the first word in the sentence decide >> what are the odds that the first word starts a name (given that it is the >> first word happens to be “Did” in a sentence, with a capital but not all >> caps) starts a person’s name. Then go to then next word in the sentence. >> If the first word was not in a name, what are the odds that the second >> word >> starts a name (given that the previous word did not start a name, the >> word >> starts with a capital (but not all capital), the word is John, and the >> previous word is “Did”). If it decides that we are starting a name at >> “John”, we are now looking for the end. What are the odds that “eat” is >> part of the name given that [“Did”: was not part of the name, was >> capitalized] and that [“John”: was the first word in the name, was >> capitalized]. You are essentially classifying [Did <- OTHER] [John >> <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. If it was “Did John >> Smith eat the stew”. You would have [Did <- OTHER] [John >> <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER]. There are >> other features other than just word, previous word, and the shape (first >> letter capitalized, all letters capitalized). I think the name finder >> uses >> part of speech also. >> >> >> So you see that it is not a name lookup table, but dependent on the >> previous classification of words earlier in the sentence. Therefore, you >> must have sentences. Does that help? >> Daniel >> >> >> Daniel Russ, Ph.D. >> Staff Scientist, Office of Intramural Research >> Center for Information Technology >> National Institutes of Health >> U.S. Department of Health and Human Services >> 12 South Drive >> Bethesda, MD 20892-5624 >> >> On Aug 25, 2016, at 9:55 AM, Damiano Porta <[email protected]<mailto >> :[email protected]><mailto: >> [email protected]<mailto:[email protected]>>< >> mailto: >> [email protected]<mailto:[email protected]><mailto: >> [email protected]>><mailto: >> [email protected]<mailto:[email protected]><mailto: >> [email protected]><mailto: >> [email protected]<mailto:[email protected]>>>> wrote: >> >> Hello everybody! >> >> Could someone explain why should I separate each sentence of my documents >> to train my models? >> My documents are like resume/cv and the sentences can be very different. >> For example a sentence could also be : >> >> 1. Name: John >> 2. Surname: travolta >> >> Etc etc >> So my question is. What is the problem if i train ny models >> (namefinder,tokenizer) with the complete resume/cv one per line? >> >> Could It be a problem? >> In this case when i will like to tokenize the resume and doing the NER i >> will simply pass the complete resume text skiping the "sentences >> detection" >> process. >> >> Thanks for your opinion in advance! >> >> Best >> Damiano >> >>
