ok, yes it should be a good solution! So, do you think is better to have "call me at + 39 06 12 23 45 56" (the telephone has 7 tokens) and add a custom feature on each token to let classifier trains it as part of the telephone number. I did it during the tokenization because i am parsing very bad documents so the telephone formats are many (separators of numbers too) . - / | \s
2017-03-02 18:38 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>: > Damino, > > I am not an expert on the NameFinder, but I don’t think you want to > use a custom tokenizer. You might consider using a custom feature > generator. I know there is an xml definition. I might create an > additional featuregenerator that looks for your regex patterns and adds a > set of features to the feature list. The nice thing about the classifier > is that you will catch things like “call me at 3011234567.” even though > your regex wont match (if you look at the previous 4 words to catch “call > me”). > > > Daniel > > On 3/2/17, 12:24 PM, "Damiano Porta" <damianopo...@gmail.com> wrote: > > Hello Daniel, yes exactly, i do that. I am using regexes to find those > patterns. > Daniel, is this problem only related to TokenNameFinderTrainer tool? > If i > train it via code should i use custom tokenizer? > If not i will follow your solution using underscores. > > Thanks > Damiano > > 2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] < > dr...@mail.nih.gov>: > > > Hi Damiano, > > In general this is a difficult problem (making n-grams from > unigrams). > > Have you considered using RegEx to find your dates/phone numbers > etc. and > > protecting them from the tokenizer (i.e. replacing the white space > with > > printable (though possible not an alphanumeric character like an > > underscore)? > > Daniel > > > > On 3/2/17, 11:46 AM, "Damiano Porta" <damianopo...@gmail.com> wrote: > > > > Hello everybody, > > > > i have created a custom tokenizer that does not split specific > > "patterns" > > like, emails, telephones, dates etc. I convert them into ONE > single > > token. > > The other parts of text are tokenized with the > > SimpleTokenizer. > > > > The problem is when i need to train a NER model. For example if > my > > data has > > dates like 2017 03 02 these will be converted into three tokens > > (whitespace > > tokenizer) i must avoid that. > > > > Can i specify the tokenizer using the TokenNameFinderTrainer > tool? > > > > Thanks > > Damiano > > > > > > > > >