Have you trained with enough examples of emails? Some tools have a sequence validator, but I think the tokenizator don't have. If there was, you could create one that would recognize this. Another option would be to customize the feature generator to add a special feature when the token looks like an email or telephone.
Regards William Em segunda-feira, 29 de agosto de 2016, Damiano Porta < [email protected]> escreveu: > Hello, > I am creating a custom tokenizer. It works pretty well but i have problems > with emails. > The emails can have _ - . that are tokenized in normal text, so the > question is, how can i train it better? > After the tokenization I need to apply different regexes to extract > email/dates/telephones so i must not tokenized such patterns. > > Thanks > Damiano > -- William Colen
