Re: PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Matthias Boehm Sun, 21 Feb 2021 09:04:38 -0800

thanks for asking here, I would separate a couple of things:

1) There is always a chance of bugs, so let's merge in the singlenodeimplementation and disable the Spark tests for this feature. Then we cantry to reproduce if there are still issues and help fix it.

2) The error is raised whenever we merge two blocks A and B into C, andthe number of non-zeros nnz(C) != nnz(A) + nnz(B) or length(C) < nnz(A)+ nnz(B), because the merge is supposed to merge disjoint cells. Thiserror usually means that either blocks got wrong block ids and thus wereincorrectly merged with other blocks (and thus overwrote cells), or theprevious operations did not maintain the NNZs correctly.


3) Keep both and we'll help figure it out.

Regards,
Matthias

On 2/21/2021 5:26 PM, Markus Reiter-Haas wrote:

Dear SystemDS developers!

I have created a reference implementation for a tokenizer in
https://github.com/apache/systemds/pull/1169 .
There is one consideration I would like to get some input on.

When representing the tokens in long-format (i.e., a transformation
that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n,
k+2)), I get the message in a follow-up `transformencode`:
          Job aborted due to stage failure: Task 0 in stage 10.0 failed
1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18,
localhost, executor driver):
org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros
mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz
source=3992)
Unfortunately, I have not been able to fix this bug since it does not
occur in the `tokenize` itself.
However, I have since implemented a wide-format (i.e., a
transformation that expands on columns (rows: n, maxTokens: m, idCols:
k) -> (n, m+k)), where I could not reproduce the issue. The current
state of the PR uses this format in the test cases and passes all
checks.

My specific questions are:
1. Does anyone know what the issue could be or how it could be fixed?
2. Conversely, why does the issue not occur on the wide-format? (I
want to ensure that the code indeed works and not just hides the
error)
3. Should I drop the support for the long-format to circumvent the issue?

Thanks and best regards,
Markus

Re: PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Reply via email to