Re: PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Markus Reiter-Haas Sun, 21 Feb 2021 09:29:28 -0800

Thanks for the quick response, and also thanks for the tip regarding
the wide-format.


Regarding keeping only the single-node implementation. Should I remove
the Spark execution altogether? Or implement a kind of switch to
disable it? Or only remove the tests for it?
I am asking since parts of the tokenizer code are only relevant for
the Spark execution (i.e., estimating the dimensions and using a max
number of tokens) and influenced some design decisions (e.g., not
creating a lookup dictionary).

Best regards,
Markus

On Sun, Feb 21, 2021 at 6:04 PM Matthias Boehm <[email protected]> wrote:
>
> thanks for asking here, I would separate a couple of things:
>
> 1) There is always a chance of bugs, so let's merge in the singlenode
> implementation and disable the Spark tests for this feature. Then we can
> try to reproduce if there are still issues and help fix it.
>
> 2) The error is raised whenever we merge two blocks A and B into C, and
> the number of non-zeros nnz(C) != nnz(A) + nnz(B) or length(C) < nnz(A)
> + nnz(B), because the merge is supposed to merge disjoint cells. This
> error usually means that either blocks got wrong block ids and thus were
> incorrectly merged with other blocks (and thus overwrote cells), or the
> previous operations did not maintain the NNZs correctly.
>
> 3) Keep both and we'll help figure it out.
>
> Regards,
> Matthias
>
> On 2/21/2021 5:26 PM, Markus Reiter-Haas wrote:
> > Dear SystemDS developers!
> >
> > I have created a reference implementation for a tokenizer in
> > https://github.com/apache/systemds/pull/1169 .
> > There is one consideration I would like to get some input on.
> >
> > When representing the tokens in long-format (i.e., a transformation
> > that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n,
> > k+2)), I get the message in a follow-up `transformencode`:
> >           Job aborted due to stage failure: Task 0 in stage 10.0 failed
> > 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18,
> > localhost, executor driver):
> > org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros
> > mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz
> > source=3992)
> > Unfortunately, I have not been able to fix this bug since it does not
> > occur in the `tokenize` itself.
> > However, I have since implemented a wide-format (i.e., a
> > transformation that expands on columns (rows: n, maxTokens: m, idCols:
> > k) -> (n, m+k)), where I could not reproduce the issue. The current
> > state of the PR uses this format in the test cases and passes all
> > checks.
> >
> > My specific questions are:
> > 1. Does anyone know what the issue could be or how it could be fixed?
> > 2. Conversely, why does the issue not occur on the wide-format? (I
> > want to ensure that the code indeed works and not just hides the
> > error)
> > 3. Should I drop the support for the long-format to circumvent the issue?
> >
> > Thanks and best regards,
> > Markus
> >

Re: PR-specific Question on Tokenizer regarding Number of non-zerosmismatch

Reply via email to