Re: Tokenizer for NER training

Russ, Daniel (NIH/CIT) [E] Thu, 02 Mar 2017 09:41:56 -0800

Damino,

    I am not an expert on the NameFinder, but I don’t think you want to use a 
custom tokenizer.  You might consider using a custom feature generator.  I know 
there is an xml definition.  I might create an additional featuregenerator that 
looks for your regex patterns and adds a set of features to the feature list.   
The nice thing about the classifier is that you will catch things like “call me 
at 3011234567.” even though your regex wont match (if you look at the previous 
4 words to catch “call me”).



Daniel

On 3/2/17, 12:24 PM, "Damiano Porta" <damianopo...@gmail.com> wrote:

    Hello Daniel, yes exactly, i do that. I am using regexes to find those
    patterns.
    Daniel, is this problem only related to TokenNameFinderTrainer tool? If i
    train it via code should i use custom tokenizer?
    If not i will follow your solution using underscores.
    
    Thanks
    Damiano
    
    2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>:
    
    > Hi Damiano,
    >    In general this is a difficult problem (making n-grams from unigrams).
    > Have you considered using RegEx to find your dates/phone numbers etc. and
    > protecting them from the tokenizer (i.e. replacing the white space with
    > printable (though possible not an alphanumeric character like an
    > underscore)?
    > Daniel
    >
    > On 3/2/17, 11:46 AM, "Damiano Porta" <damianopo...@gmail.com> wrote:
    >
    >     Hello everybody,
    >
    >     i have created a custom tokenizer that does not split specific
    > "patterns"
    >     like, emails, telephones, dates etc. I convert them into ONE single
    > token.
    >     The other parts of text are tokenized with the
    >     SimpleTokenizer.
    >
    >     The problem is when i need to train a NER model. For example if my
    > data has
    >     dates like 2017 03 02 these will be converted into three tokens
    > (whitespace
    >     tokenizer) i must avoid that.
    >
    >     Can i specify the tokenizer using the TokenNameFinderTrainer tool?
    >
    >     Thanks
    >     Damiano
    >
    >
    >

Re: Tokenizer for NER training

Reply via email to