Dear SystemDS developers!

I have created a reference implementation for a tokenizer in
https://github.com/apache/systemds/pull/1169 .
There is one consideration I would like to get some input on.

When representing the tokens in long-format (i.e., a transformation
that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n,
k+2)), I get the message in a follow-up `transformencode`:
         Job aborted due to stage failure: Task 0 in stage 10.0 failed
1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18,
localhost, executor driver):
org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros
mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz
source=3992)
Unfortunately, I have not been able to fix this bug since it does not
occur in the `tokenize` itself.
However, I have since implemented a wide-format (i.e., a
transformation that expands on columns (rows: n, maxTokens: m, idCols:
k) -> (n, m+k)), where I could not reproduce the issue. The current
state of the PR uses this format in the test cases and passes all
checks.

My specific questions are:
1. Does anyone know what the issue could be or how it could be fixed?
2. Conversely, why does the issue not occur on the wide-format? (I
want to ensure that the code indeed works and not just hides the
error)
3. Should I drop the support for the long-format to circumvent the issue?

Thanks and best regards,
Markus

-- 
Dipl.-Ing. Markus Reiter-Haas, BSc

University Assistant, PhD Student
Social Computing Lab
Institute of Interactive Systems and Data Science, TU Graz

Reply via email to