Re: msgpack read batch size larger than 4096 causes assertion error

Jean-Claude Cote Mon, 15 Oct 2018 15:51:01 -0700

I use the  rowSetIterator to get at the first batch. It should have one row
with 1 in it, then the rest of the batch are nulls. The second batch should
have 1 row with a null. The json file contains batch_size + 1 rows.


On Mon, Oct 15, 2018 at 2:32 PM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> Sure looks like a bug. I'd suggest filing a JIRA ticket for this issue,
> attaching your code and input file.
>
> The strange thing is that you suggest that Project works for smaller batch
> sizes, but not for larger numbers, but the code of your tests suggests that
> you expect the query to return a single row (or, are there more rows, but
> you're just checking the first?) If just one row, the potential batch size
> should not matter. Can you clarify?
>
> The two short-term options are to either track down the bug in Project, or
> to use your earlier, smaller batch size.
>
> Thanks,
> - Paul
>
>
>
>     On Sunday, October 14, 2018, 6:14:46 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  Hey Paul, we think alike ;-) that's exactly what I was doing the past
> couple of days. I was simplifying my test case and trying the same scenario
> with the JSON reader. I was able to reproduce the same issue using the JSON
> reader.
>
>   @Test
>   public void testArrayOfArrayJson() throws Exception {
>     try (OutputStreamWriter w = new OutputStreamWriter(new
> FileOutputStream(new File(testDir, "test.json")))) {
>       w.write("{\"arrayOfArray\":[[1],[1,2]]}\n");
>       for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH; i++) {
>         w.write("{\"anInt\":1}\n");
>       }
>     }
>     LogFixtureBuilder logBuilder = LogFixture.builder()
>         // Log to the console for debugging convenience
>         .toConsole().logger("org.apache.drill.exec", Level.TRACE);
>     try (LogFixture logs = logBuilder.build()) {
>       String sql = "select root.arrayOfArray[0][0] as w from
> `dfs.data`.`test.json` as root";
>       rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator();
>
>       schemaBuilder = new SchemaBuilder();
>       schemaBuilder.add("w", TypeProtos.MinorType.BIGINT,
> DataMode.OPTIONAL);
>       expectedSchema = schemaBuilder.buildSchema();
>
>       DirectRowSet batch1 = nextRowSet();
>       rowSetBuilder = newRowSetBuilder();
>       rowSetBuilder.addRow(1L);
>       for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH - 1; i++)
> {
>         rowSetBuilder.addRow(new Object[] { null });
>       }
>       verify(rowSetBuilder.build(), batch1);
>       DirectRowSet batch2 = nextRowSet();
>       rowSetBuilder = newRowSetBuilder();
>       rowSetBuilder.addRow(new Object[] { null });
>       verify(rowSetBuilder.build(), batch2);
>     }
>   }
>
> The test passes. Then I change
>   public static final long DEFAULT_ROWS_PER_BATCH =
> BaseValueVector.INITIAL_VALUE_ALLOCATION ;
> to be
>   public static final long DEFAULT_ROWS_PER_BATCH =
> BaseValueVector.INITIAL_VALUE_ALLOCATION + 1;
> and the test case fails.
>
> I can attach the whole trace output if you like.
>
> On Sat, Oct 13, 2018 at 7:44 PM Paul Rogers <[email protected]>
> wrote:
>
> > Hi JC,
> >
> > Your test code looks OK. Looks like you've gotten quite good at using the
> > row set classes for testing. If you like, you can simplify your code
> just a
> > bit by using the rowSet() method of the query builder: it skips the first
> > empty batch for you. Plus, if your result is only a single row with a
> > single long column, you can further simplify by calling singletonLong()
> > which will grab just that one value.
> >
> > Here is another thought. Perhaps you are hitting an existing bug
> somewhere
> > in Drill. The Repeated List vector, as noted previously, is under-used
> and
> > may still contain bugs in some operators. Let's see if we can rule out
> this
> > case. I suggest this because, when I was testing an updated version of
> the
> > JSON reader, I did encounter bugs elsewhere in Drill, but I can't recall
> if
> > the problem was with the LIST or REPEATED LIST type...
> >
> > Try converting your data to JSON then issue the same query using the JSON
> > reader. This will tell you if the bug is in your code (JSON succeeds) or
> in
> > Drill itself (JSON fails.) You may have to temporarily force the JSON
> > reader to read the same number of records per batch as your reader does.
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Friday, October 12, 2018, 6:22:40 PM PDT, Paul Rogers <
> > [email protected]> wrote:
> >
> >  Drill enforces two hard limits:
> > 1. The maximum number of rows in a batch is 64K.
> > 2. The maximum size of any vector is 4 GB.
> >
> > We have found, however, that fragmentation occurs in our memory allocator
> > for any vector larger than 16 MB. (This is, in fact the original reason
> for
> > the result set loader stuff I've been rambling on about.)
> >
> > Your DEFAULT_ROWS_PER_BATCH is now set to 4K * 4 = 16K. This is a fine
> > number of rows (completely depending, of course, on row width.)
> >
> > The problem you are having is that you are trying to index a repeated
> list
> > vector past its end. This very likely means that your code that built the
> > vector has bugs.
> >
> > RepeatedList is tricky: it is an offset vector that wraps a Repeated
> > vector. It is important to get all those offsets just right. Remember
> that,
> > in offset vectors, the offset is one greater than the value. (Row 3 needs
> > an offset written in offset vector position 4.)
> >
> > Here I'll gently suggest the use of the RowSet abstractions which have
> > been tested to ensure that they do properly construct each form of
> vector.
> > Let that code do the dirty work for you of mucking with the various
> offsets.
> >
> > Alternatively, look at the RowSet (column writers) or ComplexWriter to to
> > see if you can figure out what those mechanims are doing that your code
> is
> > missing.
> >
> > Here's how I'd debug this. Write a test that an exercise your reader in
> > isolation. That is, exercise the reader outside of any query, just by
> > itself. Doing so is a bit tricky given how the scan operator works, but
> is
> > possible. Check out the external sort unit tests for some examples;
> perhaps
> > other developers can point out to others.
> >
> > Configure the reader to read a simple file with just a few rows. Create
> > files that include each type. (Easier to test if you include a few
> columns
> > in each of several files, rather than one big file with all column
> types.)
> > This will give you a record batch with what was read.
> >
> > Then, use the RowSet mechanisms to build up an expected record batch,
> then
> > compare the expected value with your actual value. This is a much easier
> > mechanism that using the Project operator to catch your vector structure
> > errors.
> >
> > I hope this helps...
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Friday, October 12, 2018, 5:31:53 PM PDT, Jean-Claude Cote <
> > [email protected]> wrote:
> >
> >  I've changed my batch size record reader to be larger. All my test cases
> > still work as I would expect them, except for 1 and I have no idea why?
> I'v
> > turned on tracing in the hopes of getting a hint. I now see it is in a
> > generated projection class but I'm not sure why.. Can anyone speculate
> why
> > a change in batch size would make cause such a failure?
> >
> > I've added my record reader change, test case and error from the trace.
> > Thanks
> > jc
> >
> > public class MsgpackRecordReader extends AbstractRecordReader {
> >  private static final org.slf4j.Logger logger =
> > org.slf4j.LoggerFactory.getLogger(MsgpackRecordReader.class);
> >
> >  public static final long DEFAULT_ROWS_PER_BATCH =
> > BaseValueVector.INITIAL_VALUE_ALLOCATION * 4;
> >
> >  @Test
> >  public void testSchemaArrayOfArrayCell() throws Exception {
> >    LogFixtureBuilder logBuilder = LogFixture.builder()
> >        // Log to the console for debugging convenience
> >        .toConsole().logger("org.apache.drill.exec", Level.TRACE);
> >    try (LogFixture logs = logBuilder.build()) {
> >      learnModel();
> >      String sql = "select root.arrayOfarray[0][0] as w from
> > dfs.data.`secondBatchHasCompleteModel.mp` as root";
> >      rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator();
> >
> >      schemaBuilder.add("w", TypeProtos.MinorType.BIGINT,
> > TypeProtos.DataMode.OPTIONAL);
> >      expectedSchema = schemaBuilder.buildSchema();
> >      verifyFirstBatchNull();
> >
> >      rowSetBuilder = newRowSetBuilder();
> >      rowSetBuilder.addRow(1L);
> >      verify(rowSetBuilder.build(), nextRowSet());
> >    }
> >  }
> >
> >
> >
> >
> >
> >
> > java.lang.AssertionError: null
> >        at
> >
> >
> org.apache.drill.exec.vector.complex.RepeatedListVector$DelegateRepeatedVector$RepeatedListAccessor.get(RepeatedListVector.java:73)
> > ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
> >        at
> >
> >
> org.apache.drill.exec.vector.complex.impl.RepeatedListReaderImpl.setPosition(RepeatedListReaderImpl.java:95)
> > ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
> >        at
> >
> >
> org.apache.drill.exec.test.generated.ProjectorGen1.doEval(ProjectorTemplate.java:27)
> > ~[na:na]
> >        at
> >
> >
> org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords(ProjectorTemplate.java:67)
> > ~[na:na]
> >        at
> >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork(ProjectRecordBatch.java:232)
> > ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
> >        at
> >
> >
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117)
> > ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
> >
>

Re: msgpack read batch size larger than 4096 causes assertion error

Reply via email to