[GitHub] [systemds] Iseratho commented on pull request #1169: [WIP] Tokenizer Reference Implementation

GitBox Sat, 30 Jan 2021 13:16:05 -0800


Iseratho commented on pull request #1169:
URL: https://github.com/apache/systemds/pull/1169#issuecomment-770282063



   ##  Update on the PR
   The PR now contains a reference implementation for the tokenization API. It 
can be used for simple tokenization and is extensible with new algorithms.
   
   It provides the following features:
   - [x] 2 simple tokenization algorithms (i.e., whitespace and ngram)
   - [x] 2 output representations (i.e., count and position)
   - [x] algorithms are configurable with JSON spec
   
   Notes on design considerations:
   - [x] output representation is independent of the tokenizer algorithm
   - [x] distributable function, as it does not create a dictionary for the 
tokens (tokens can be encoded with `transformencode`)
   - [x] API similar to transform functions (e.g., using JSON spec)
   
   There is still an open issue regarding correctly setting the 
DataCharacterstics for Spark execution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] Iseratho commented on pull request #1169: [WIP] Tokenizer Reference Implementation

Reply via email to