Iseratho opened a new pull request #1169:
URL: https://github.com/apache/systemds/pull/1169


   This PR contains an initial implementation of a text tokenizer that serves 
as a design proposal. 
   
   ## Why does SystemDS benefit from an built-in tokenizer feature?
   
   A built-in tokenizer is an important feature to allow end-to-end machine 
learning. At the moment SystemDS expects textual data to be provided in a 
tokenized form. For instance, one test case reads the 20 newsgroup dataset in 
bag-of-words (BoW) format. 
   Currently, SystemDS supports transformations for frame-to-matrix and vice 
versa, as well as row-wise transformations via `map()`. However, textual data 
can be of arbitrary length. Thus, it does not fit the already built-in row-wise 
(i.e., one-to-one) mapping.
   
   ## Proposed Tokenize API
   
   Our proposed tokenizer API is tailored towards SystemDs and provides a 
`tokenize()` function that enables frame to frame transformation using a JSON 
specification similar to the transform function. We have included an initial 
draft implementation (minimalistic whitespace tokenizer for BoW format), which 
shows how we would incorporate the tokenization feature and showcases it with 
test cases.
   Similar to the transform functions, this should be extendable with new 
algorithms each with different configuration options and executable in a 
distributed way (Spark/Hybrid execution). 
   The details are described in the [DML 
reference](docs/site/dml-language-reference.md).
   
   ## State of PR
   
   This PR is not to be merged at the moment, but should provide a way to 
discuss and improve the design considerations. We plan to integrate more 
tokenizers and improve the options to configure the algorithms, as well as 
showcase the usage of the tokenizer by creating a standard natural language 
pipeline  (e.g., preprocess, tokenize, classify).
   
   ## Authors of PR
   
   This design proposal was jointly authored by @Iseratho, @skogler, and 
@davidfroehlich.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to