Hello everybody, i have created a custom tokenizer that does not split specific "patterns" like, emails, telephones, dates etc. I convert them into ONE single token. The other parts of text are tokenized with the SimpleTokenizer.
The problem is when i need to train a NER model. For example if my data has dates like 2017 03 02 these will be converted into three tokens (whitespace tokenizer) i must avoid that. Can i specify the tokenizer using the TokenNameFinderTrainer tool? Thanks Damiano