ilovemesomeramen opened a new pull request, #1639:
URL: https://github.com/apache/systemds/pull/1639

   This PR expands the existing Tokenisation to work with multi-threading, and 
additionally adds some minor features and bug fixes.
   MINOR bug fix: Removed hardcoded thread values in MultiColumnEncoder.
   
   Tokenisation was moved into the `build` and `apply` paradigm as was 
introduced for `transformencode`. The multithreading implementation uses the 
same structure as `transformencode`. In the `build` stage the input gets split 
into tokens and saved in an internal representation, also additional metadata 
is computed which is then needed in the `apply` phase. During the `apply` the 
computed data is retrieved and written to the output. The current 
implementation splits the input frame into row partitions. Default is 64, which 
can be changed with the `sysds.parallel.tokenize.numBlocks` configuration. 
   This is the first implementation, some known issues are, (1) memory 
consumption, the execution DAG is not yet well optimized for memory 
consumption. This could be fixed in the future by computing subsets one after 
another (only possible when padding is enabled). (2) cache performance, similar 
issue as 1, by computing subsets first in a cache-aware (unrolling loops) 
manner performance could increase. 
   There is still quite some redundant code in the `TokenizerApplier`'s, 
although this is not so easy to clean up without a major refactor of the 
previous implementation. 
   Multithreading is disabled per default at the moment, and can be activated 
via the `sysds.parallel.tokenize` config.
   
   Introduced `ngram_type` as a configuration for the `ngram` tokenizer.
   Differentiates between `token` and `document`.
   `token` creates the ngram over each token e.g. your tokens are ['hello', 
'this', 'is', 'a', 'nice', 'pr'] a 3 ngram would give you the tokens ['hel', 
'ell', 'llo', 'thi', 'his', 'nic', 'ice']
   if you would use `document`on the other hand, the ngram is computed over the 
tokens in the document giving you:
   ['"('hello', 'this', 'is')"', '"('this', 'is', 'a')"', '"('is', 'a', 
'nice')"', '"('a', 'nice', 'pr')"']
   
   Introduced `apply_padding` for tokenizer spec, specifies if the output 
should be padded to `max_tokens`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to