Hello, I am creating a custom tokenizer. It works pretty well but i have problems with emails. The emails can have _ - . that are tokenized in normal text, so the question is, how can i train it better? After the tokenization I need to apply different regexes to extract email/dates/telephones so i must not tokenized such patterns.
Thanks Damiano
