ok, yes it should be a good solution!

So, do you think is better to have "call me at + 39 06 12 23 45 56" (the
telephone has 7 tokens) and add a custom feature on each token to let
classifier trains it as part of the telephone number.
I did it during the tokenization because i am parsing very bad documents so
the telephone formats are many (separators of numbers too) . - / | \s

2017-03-02 18:38 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>:

> Damino,
>
>     I am not an expert on the NameFinder, but I don’t think you want to
> use a custom tokenizer.  You might consider using a custom feature
> generator.  I know there is an xml definition.  I might create an
> additional featuregenerator that looks for your regex patterns and adds a
> set of features to the feature list.   The nice thing about the classifier
> is that you will catch things like “call me at 3011234567.” even though
> your regex wont match (if you look at the previous 4 words to catch “call
> me”).
>
>
> Daniel
>
> On 3/2/17, 12:24 PM, "Damiano Porta" <damianopo...@gmail.com> wrote:
>
>     Hello Daniel, yes exactly, i do that. I am using regexes to find those
>     patterns.
>     Daniel, is this problem only related to TokenNameFinderTrainer tool?
> If i
>     train it via code should i use custom tokenizer?
>     If not i will follow your solution using underscores.
>
>     Thanks
>     Damiano
>
>     2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <
> dr...@mail.nih.gov>:
>
>     > Hi Damiano,
>     >    In general this is a difficult problem (making n-grams from
> unigrams).
>     > Have you considered using RegEx to find your dates/phone numbers
> etc. and
>     > protecting them from the tokenizer (i.e. replacing the white space
> with
>     > printable (though possible not an alphanumeric character like an
>     > underscore)?
>     > Daniel
>     >
>     > On 3/2/17, 11:46 AM, "Damiano Porta" <damianopo...@gmail.com> wrote:
>     >
>     >     Hello everybody,
>     >
>     >     i have created a custom tokenizer that does not split specific
>     > "patterns"
>     >     like, emails, telephones, dates etc. I convert them into ONE
> single
>     > token.
>     >     The other parts of text are tokenized with the
>     >     SimpleTokenizer.
>     >
>     >     The problem is when i need to train a NER model. For example if
> my
>     > data has
>     >     dates like 2017 03 02 these will be converted into three tokens
>     > (whitespace
>     >     tokenizer) i must avoid that.
>     >
>     >     Can i specify the tokenizer using the TokenNameFinderTrainer
> tool?
>     >
>     >     Thanks
>     >     Damiano
>     >
>     >
>     >
>
>
>

Reply via email to