Hello everybody,

i have created a custom tokenizer that does not split specific "patterns"
like, emails, telephones, dates etc. I convert them into ONE single token.
The other parts of text are tokenized with the
SimpleTokenizer.

The problem is when i need to train a NER model. For example if my data has
dates like 2017 03 02 these will be converted into three tokens (whitespace
tokenizer) i must avoid that.

Can i specify the tokenizer using the TokenNameFinderTrainer tool?

Thanks
Damiano

Reply via email to