szha commented on issue #9514: Language Modeling Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#issuecomment-360668139
 
 
   To address the concern of merging datasets based on frequencies, I made the 
frequencies counts a property of the dataset too. This way, user has the 
control on how vocabulary is made.
   
   Currently the tokenization is naive and the next step should be to have a 
proper tokenizer class. Once that's available, the datasets should expose an 
option for specifying tokenizers.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to