Tokenizer for NER training

Damiano Porta Thu, 02 Mar 2017 08:46:37 -0800

Hello everybody,

i have created a custom tokenizer that does not split specific "patterns"
like, emails, telephones, dates etc. I convert them into ONE single token.
The other parts of text are tokenized with the
SimpleTokenizer.


The problem is when i need to train a NER model. For example if my data has
dates like 2017 03 02 these will be converted into three tokens (whitespace
tokenizer) i must avoid that.

Can i specify the tokenizer using the TokenNameFinderTrainer tool?

Thanks
Damiano

Tokenizer for NER training

Reply via email to