e-strauss opened a new pull request, #2130: URL: https://github.com/apache/systemds/pull/2130
[SYSTEMDS-3782] Bag-of-words Encoder for CP This patch adds a new feature transformation Bag-of-words (bow) to the to SystemDS' parallel feature transformation framework UPLIFT. Currently, the operation is only supported for CP. I had to adapt the framework a little bit, because the bow encoders behaves differently than other encoders and can create multiple non-zero values from a single input column. In comparison, encoders like Dummycode (dc) create more columns, but result always in just one non-zero value from each dc encoder. So, when a bow encoders is involved, the nnz values for each row is not known upfront, which is problematic for a parallel apply with CSR matrix output. I added a new field in the ColumnEncoder which contains the nnz for each row, which is known after the build is completed. Similarly to recode, we estimate the number of distinct tokens of the bow dictionary, by sampling from a subset of rows. In contrast, this estimation is computational more complex for each row than for recode through the whole tokenisation process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org