Iseratho opened a new pull request #1169: URL: https://github.com/apache/systemds/pull/1169
This PR contains an initial implementation of a text tokenizer that serves as a design proposal. ## Why does SystemDS benefit from an built-in tokenizer feature? A built-in tokenizer is an important feature to allow end-to-end machine learning. At the moment SystemDS expects textual data to be provided in a tokenized form. For instance, one test case reads the 20 newsgroup dataset in bag-of-words (BoW) format. Currently, SystemDS supports transformations for frame-to-matrix and vice versa, as well as row-wise transformations via `map()`. However, textual data can be of arbitrary length. Thus, it does not fit the already built-in row-wise (i.e., one-to-one) mapping. ## Proposed Tokenize API Our proposed tokenizer API is tailored towards SystemDs and provides a `tokenize()` function that enables frame to frame transformation using a JSON specification similar to the transform function. We have included an initial draft implementation (minimalistic whitespace tokenizer for BoW format), which shows how we would incorporate the tokenization feature and showcases it with test cases. Similar to the transform functions, this should be extendable with new algorithms each with different configuration options and executable in a distributed way (Spark/Hybrid execution). The details are described in the [DML reference](docs/site/dml-language-reference.md). ## State of PR This PR is not to be merged at the moment, but should provide a way to discuss and improve the design considerations. We plan to integrate more tokenizers and improve the options to configure the algorithms, as well as showcase the usage of the tokenizer by creating a standard natural language pipeline (e.g., preprocess, tokenize, classify). ## Authors of PR This design proposal was jointly authored by @Iseratho, @skogler, and @davidfroehlich. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
