Re: Is sentence detection process really needed?

Damiano Porta Fri, 26 Aug 2016 13:03:41 -0700

Pardon i meant the "my" word ...

Il 26/Ago/2016 20:49, "Damiano Porta" <[email protected]> ha scritto:


> But i think It is the same no? I Mean. ..I will pass all the content as
> one sentence. So in this case the "the" word will be tagged the same.
>
> The problem in this case is that i need to create a tagger model too...
>
> Il 26/Ago/2016 20:14, "Russ, Daniel (NIH/CIT) [E]" <[email protected]>
> ha scritto:
>
>> The POSTaggerME uses tokenized sentences. In your example, both cases
>> have 2 sentences. sentence 1=My name is Damiano.  sentence 2=My surname is
>> Porta..
>>
>> POSTaggerME tagger=…
>> tagger.tag(new String[]{ “My”,”name”,”is”,”Damiano”});
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 26, 2016, at 1:46 PM, Damiano Porta <[email protected]<mailto
>> :[email protected]>> wrote:
>>
>> Hmmm why?
>> If i use the postagger for:
>> "My name is Damiano. My surname is Porta"
>>
>> OR separate:
>>
>> My name is Damiano.
>> My surname is Porta.
>>
>> I think the tags will be the same, no?
>>
>> Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <[email protected]
>> <mailto:[email protected]>> ha
>> scritto:
>>
>> If you want to use the part of speech (from the POSTaggerME) as a feature,
>> you will need sentences.
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 26, 2016, at 12:15 PM, Damiano Porta <[email protected]
>> <mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>>
>> Thanks Joern!
>> If i have understood you correctly ...
>> IF i do not need relation between sentences i can skip the sentences
>> detection right?
>>
>> Il 26/Ago/2016 16:33, "Joern Kottmann" <[email protected]<mailto:kot
>> [email protected]><mailto:kot
>> [email protected]<mailto:[email protected]>>> ha scritto:
>>
>> The name finder has the concept of "adaptive data" in the feature
>> generation. The feature generators can remember things from previous
>> sentences and use it to generate features based on it. Usually that can
>> help with the recognition rate if you have names that are repeated.  You
>> can tweak this to your data, or just pass in the entire document.
>>
>> Jörn
>>
>> On Fri, Aug 26, 2016 at 3:25 PM, Damiano Porta <[email protected]
>> <mailto:[email protected]><
>> mailto:[email protected]>>
>> wrote:
>>
>> Hi!
>> Yes I can train a good model (sure It will takes a lot of time), i have
>> 30k
>> resumes. So the "data" isnt a problem.
>> I thought about many things, i am also creating a custom features
>> generator, with dictionary too (for names) and regex for Birthday,  then
>> the machine learning will look at their contexts.
>> So now i need to separate the sentences to create a custom model.
>> At this point i will not try with one per line CV.
>>
>> Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <[email protected]
>> <mailto:[email protected]>
>> <mailto:[email protected]>>
>> ha
>> scritto:
>>
>> Hi Damiano,
>>  I am not sure that the NameFinder will be effective as-is for you.  Do
>> you have training data (and I mean a lot of training data)?  You need to
>> consider what feature are useful in your case.  You might consider a
>> feature such as line number on the page (since people tend to put their
>> name on the top or second line), maybe the font-size.  You can add a
>> dictionary of common names and have a feature “inDictionary”. You will
>> have
>> to use your domain knowledge to help you here.
>>
>> For birthday you may want to consider using regex to pick out dates.
>> Then look at the context around the date (words before/after, remove
>> graduated or if another date just before) or maybe years before present
>> year (if you are looking at resumes, you probably won’t find any 5 year
>> olds or 200 year olds.
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 26, 2016, at 5:57 AM, Damiano Porta <[email protected]<mailto
>> :[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><
>> mailto:
>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]>>> wrote:
>>
>> Hi Daniel!
>>
>> Thank you so much for your opinion.
>> It makes perfectly sense. But i am still a bit confused about the length
>> of
>> the sentences.
>> In a resume there are many names, dates etc etc. So my doubt is regarding
>> the structure of the sentences because they follow specific patterns
>> sometimes.
>>
>> For example i need to extract the personal name, (Who wrote the resume)
>> the
>> Birthday etc etc.
>>
>> As You know there are many names and dates inside a resume so i thought
>> about to write the entire resume as sentence to also train the "position"
>> less or more of the entities. If i "decompose" all the resume into
>> sentences i will lose this information. No?
>>
>> Damiano
>>
>> Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <[email protected]
>> <mailto:[email protected]>
>> <mailto:[email protected]>
>> <mailto:[email protected]>> ha
>> scritto:
>>
>> Hi Damiano,
>>
>>   Everyone can feel feel to correct my ignorance but I view the the
>> name finder as follows.
>>
>>   I look at it as walking down the sentence and classifying words as
>> “NOT IN NAME”  until I hit the start of a name than it is “START NAME”,
>> Followed by “STILL IN NAME” until “NOT IN NAME”.  Take the sentence “Did
>> John eat the stew”.  Starting with the first word in the sentence decide
>> what are the odds that the first word starts a name (given that it is the
>> first word happens to be “Did” in a sentence, with a capital but not all
>> caps) starts a person’s name.  Then go to then next word in the sentence.
>> If the first word was not in a name, what are the odds that the second
>> word
>> starts a name (given that the previous word did not start a name, the
>> word
>> starts with a capital (but not all capital), the word is John, and the
>> previous word is “Did”).  If it decides that we are starting a name at
>> “John”, we are now looking for the end.  What are the odds that “eat” is
>> part of the name given that [“Did”: was not part of the name, was
>> capitalized] and that [“John”: was the first word in the name, was
>> capitalized].   You are essentially classifying [Did <- OTHER] [John
>> <-START] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  If it was “Did John
>> Smith eat the stew”.  You would have [Did <- OTHER] [John
>> <-START][Smith<-IN] [eat<-OTHER] [the<-OTHER] [stew<-OTHER].  There are
>> other features other than just word, previous word, and the shape (first
>> letter capitalized, all letters capitalized).  I think the name finder
>> uses
>> part of speech also.
>>
>>
>>  So you see that it is not a name lookup table, but dependent on the
>> previous classification of words earlier in the sentence.  Therefore, you
>> must have sentences. Does that help?
>> Daniel
>>
>>
>> Daniel Russ, Ph.D.
>> Staff Scientist, Office of Intramural Research
>> Center for Information Technology
>> National Institutes of Health
>> U.S. Department of Health and Human Services
>> 12 South Drive
>> Bethesda,  MD 20892-5624
>>
>> On Aug 25, 2016, at 9:55 AM, Damiano Porta <[email protected]<mailto
>> :[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><
>> mailto:
>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]>><mailto:
>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]><mailto:
>> [email protected]<mailto:[email protected]>>>> wrote:
>>
>> Hello everybody!
>>
>> Could someone explain why should I separate each sentence of my documents
>> to train my models?
>> My documents are like resume/cv and the sentences can be very different.
>> For example a sentence could also be :
>>
>> 1. Name: John
>> 2. Surname: travolta
>>
>> Etc etc
>> So my question is. What is the problem if i train ny models
>> (namefinder,tokenizer) with the complete resume/cv one per line?
>>
>> Could It be a problem?
>> In this case when i will like to tokenize the resume and doing the NER i
>> will simply pass the complete resume text skiping the "sentences
>> detection"
>> process.
>>
>> Thanks for your opinion in advance!
>>
>> Best
>> Damiano
>>
>>

Re: Is sentence detection process really needed?

Reply via email to