Have you trained with enough examples of emails?
Some tools have a sequence validator, but I think the tokenizator don't
have. If there was, you could create one that would recognize this.
Another option would be to customize the feature generator to add a special
feature when the token looks like an email or telephone.


Regards
William


Em segunda-feira, 29 de agosto de 2016, Damiano Porta <
[email protected]> escreveu:

> Hello,
> I am creating a custom tokenizer. It works pretty well but i have problems
> with emails.
> The emails can have _ - . that are tokenized in normal text, so the
> question is, how can i train it better?
> After the tokenization I need to apply different regexes to extract
> email/dates/telephones so i must not tokenized such patterns.
>
> Thanks
> Damiano
>


-- 
William Colen

Reply via email to