Thanks for the quick response, and also thanks for the tip regarding the wide-format.
Regarding keeping only the single-node implementation. Should I remove the Spark execution altogether? Or implement a kind of switch to disable it? Or only remove the tests for it? I am asking since parts of the tokenizer code are only relevant for the Spark execution (i.e., estimating the dimensions and using a max number of tokens) and influenced some design decisions (e.g., not creating a lookup dictionary). Best regards, Markus On Sun, Feb 21, 2021 at 6:04 PM Matthias Boehm <[email protected]> wrote: > > thanks for asking here, I would separate a couple of things: > > 1) There is always a chance of bugs, so let's merge in the singlenode > implementation and disable the Spark tests for this feature. Then we can > try to reproduce if there are still issues and help fix it. > > 2) The error is raised whenever we merge two blocks A and B into C, and > the number of non-zeros nnz(C) != nnz(A) + nnz(B) or length(C) < nnz(A) > + nnz(B), because the merge is supposed to merge disjoint cells. This > error usually means that either blocks got wrong block ids and thus were > incorrectly merged with other blocks (and thus overwrote cells), or the > previous operations did not maintain the NNZs correctly. > > 3) Keep both and we'll help figure it out. > > Regards, > Matthias > > On 2/21/2021 5:26 PM, Markus Reiter-Haas wrote: > > Dear SystemDS developers! > > > > I have created a reference implementation for a tokenizer in > > https://github.com/apache/systemds/pull/1169 . > > There is one consideration I would like to get some input on. > > > > When representing the tokens in long-format (i.e., a transformation > > that expands on rows (rows: n, maxTokens: m, idCols: k) -> (m*n, > > k+2)), I get the message in a follow-up `transformencode`: > > Job aborted due to stage failure: Task 0 in stage 10.0 failed > > 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 18, > > localhost, executor driver): > > org.apache.sysds.runtime.DMLRuntimeException: Number of non-zeros > > mismatch on merge disjoint (target=1000x4, nnz target=4000, nnz > > source=3992) > > Unfortunately, I have not been able to fix this bug since it does not > > occur in the `tokenize` itself. > > However, I have since implemented a wide-format (i.e., a > > transformation that expands on columns (rows: n, maxTokens: m, idCols: > > k) -> (n, m+k)), where I could not reproduce the issue. The current > > state of the PR uses this format in the test cases and passes all > > checks. > > > > My specific questions are: > > 1. Does anyone know what the issue could be or how it could be fixed? > > 2. Conversely, why does the issue not occur on the wide-format? (I > > want to ensure that the code indeed works and not just hides the > > error) > > 3. Should I drop the support for the long-format to circumvent the issue? > > > > Thanks and best regards, > > Markus > >
