Iseratho commented on pull request #1169: URL: https://github.com/apache/systemds/pull/1169#issuecomment-770282063
## Update on the PR The PR now contains a reference implementation for the tokenization API. It can be used for simple tokenization and is extensible with new algorithms. It provides the following features: - [x] 2 simple tokenization algorithms (i.e., whitespace and ngram) - [x] 2 output representations (i.e., count and position) - [x] algorithms are configurable with JSON spec Notes on design considerations: - [x] output representation is independent of the tokenizer algorithm - [x] distributable function, as it does not create a dictionary for the tokens (tokens can be encoded with `transformencode`) - [x] API similar to transform functions (e.g., using JSON spec) There is still an open issue regarding correctly setting the DataCharacterstics for Spark execution. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
