Iseratho edited a comment on pull request #1169: URL: https://github.com/apache/systemds/pull/1169#issuecomment-770282063
## Update on the PR The PR now contains a reference implementation for the tokenization API. It can be used for simple tokenization and is extensible with new algorithms. It provides the following features: - 2 simple tokenization algorithms (i.e., whitespace and ngram) - 2 output representations (i.e., count and position) - support for both long and wide format - algorithms are configurable with JSON spec Notes on design considerations: - output representation is independent of the tokenizer algorithm - distributable function, as it does not create a dictionary for the tokens (tokens can be encoded with `transformencode`) - API similar to transform functions (e.g., using JSON spec) - you need to specify a maximum number of tokens for correctly setting the DataCharacterstics in Spark execution ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
