[GitHub] [systemds] Iseratho edited a comment on pull request #1169: [WIP] Tokenizer Reference Implementation

GitBox Sun, 21 Feb 2021 05:48:53 -0800


Iseratho edited a comment on pull request #1169:
URL: https://github.com/apache/systemds/pull/1169#issuecomment-770282063



   ##  Update on the PR
   The PR now contains a reference implementation for the tokenization API. It 
can be used for simple tokenization and is extensible with new algorithms.
   
   It provides the following features:
   - 2 simple tokenization algorithms (i.e., whitespace and ngram)
   - 2 output representations (i.e., count and position)
   - support for both long and wide format
   - algorithms are configurable with JSON spec
   
   Notes on design considerations:
   - output representation is independent of the tokenizer algorithm
   - distributable function, as it does not create a dictionary for the tokens 
(tokens can be encoded with `transformencode`)
   - API similar to transform functions (e.g., using JSON spec)
   - you need to specify a maximum number of tokens for correctly setting the 
DataCharacterstics in Spark execution


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] Iseratho edited a comment on pull request #1169: [WIP] Tokenizer Reference Implementation

Reply via email to