ilovemesomeramen opened a new pull request, #1639: URL: https://github.com/apache/systemds/pull/1639
This PR expands the existing Tokenisation to work with multi-threading, and additionally adds some minor features and bug fixes. MINOR bug fix: Removed hardcoded thread values in MultiColumnEncoder. Tokenisation was moved into the `build` and `apply` paradigm as was introduced for `transformencode`. The multithreading implementation uses the same structure as `transformencode`. In the `build` stage the input gets split into tokens and saved in an internal representation, also additional metadata is computed which is then needed in the `apply` phase. During the `apply` the computed data is retrieved and written to the output. The current implementation splits the input frame into row partitions. Default is 64, which can be changed with the `sysds.parallel.tokenize.numBlocks` configuration. This is the first implementation, some known issues are, (1) memory consumption, the execution DAG is not yet well optimized for memory consumption. This could be fixed in the future by computing subsets one after another (only possible when padding is enabled). (2) cache performance, similar issue as 1, by computing subsets first in a cache-aware (unrolling loops) manner performance could increase. There is still quite some redundant code in the `TokenizerApplier`'s, although this is not so easy to clean up without a major refactor of the previous implementation. Multithreading is disabled per default at the moment, and can be activated via the `sysds.parallel.tokenize` config. Introduced `ngram_type` as a configuration for the `ngram` tokenizer. Differentiates between `token` and `document`. `token` creates the ngram over each token e.g. your tokens are ['hello', 'this', 'is', 'a', 'nice', 'pr'] a 3 ngram would give you the tokens ['hel', 'ell', 'llo', 'thi', 'his', 'nic', 'ice'] if you would use `document`on the other hand, the ngram is computed over the tokens in the document giving you: ['"('hello', 'this', 'is')"', '"('this', 'is', 'a')"', '"('is', 'a', 'nice')"', '"('a', 'nice', 'pr')"'] Introduced `apply_padding` for tokenizer spec, specifies if the output should be padded to `max_tokens`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org