e-strauss opened a new pull request, #2130:
URL: https://github.com/apache/systemds/pull/2130

   [SYSTEMDS-3782] Bag-of-words Encoder for CP
   
   This patch adds a new feature transformation Bag-of-words (bow) to the to 
SystemDS' parallel feature transformation framework UPLIFT. Currently, the 
operation is only supported for CP.
   I had to adapt the framework a little bit, because the bow encoders behaves 
differently than other encoders and can create multiple non-zero values from a 
single input column. In comparison, encoders like Dummycode (dc) create more 
columns, but result always in just one non-zero value from each dc encoder. So, 
when a bow encoders is involved, the nnz values for each row is not known 
upfront, which is problematic for a parallel apply with CSR matrix output.
   I added a new field in the ColumnEncoder which contains the nnz for each 
row, which is known after the build is completed.
   
   Similarly to recode, we estimate the number of distinct tokens of the bow 
dictionary, by sampling from a subset of rows. In contrast, this estimation is 
computational more complex for each row than for recode through the whole 
tokenisation process. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to