Re: msgpack read batch size larger than 4096 causes assertion error

Paul Rogers Fri, 12 Oct 2018 18:23:03 -0700

Drill enforces two hard limits:
1. The maximum number of rows in a batch is 64K.
2. The maximum size of any vector is 4 GB.


We have found, however, that fragmentation occurs in our memory allocator for 
any vector larger than 16 MB. (This is, in fact the original reason for the 
result set loader stuff I've been rambling on about.)

Your DEFAULT_ROWS_PER_BATCH is now set to 4K * 4 = 16K. This is a fine number 
of rows (completely depending, of course, on row width.)

The problem you are having is that you are trying to index a repeated list 
vector past its end. This very likely means that your code that built the 
vector has bugs.

RepeatedList is tricky: it is an offset vector that wraps a Repeated vector. It 
is important to get all those offsets just right. Remember that, in offset 
vectors, the offset is one greater than the value. (Row 3 needs an offset 
written in offset vector position 4.)

Here I'll gently suggest the use of the RowSet abstractions which have been 
tested to ensure that they do properly construct each form of vector. Let that 
code do the dirty work for you of mucking with the various offsets.

Alternatively, look at the RowSet (column writers) or ComplexWriter to to see 
if you can figure out what those mechanims are doing that your code is missing.

Here's how I'd debug this. Write a test that an exercise your reader in 
isolation. That is, exercise the reader outside of any query, just by itself. 
Doing so is a bit tricky given how the scan operator works, but is possible. 
Check out the external sort unit tests for some examples; perhaps other 
developers can point out to others.

Configure the reader to read a simple file with just a few rows. Create files 
that include each type. (Easier to test if you include a few columns in each of 
several files, rather than one big file with all column types.) This will give 
you a record batch with what was read.

Then, use the RowSet mechanisms to build up an expected record batch, then 
compare the expected value with your actual value. This is a much easier 
mechanism that using the Project operator to catch your vector structure errors.

I hope this helps...

Thanks,
- Paul

 

    On Friday, October 12, 2018, 5:31:53 PM PDT, Jean-Claude Cote 
<[email protected]> wrote:  
 
 I've changed my batch size record reader to be larger. All my test cases
still work as I would expect them, except for 1 and I have no idea why? I'v
turned on tracing in the hopes of getting a hint. I now see it is in a
generated projection class but I'm not sure why.. Can anyone speculate why
a change in batch size would make cause such a failure?

I've added my record reader change, test case and error from the trace.
Thanks
jc

public class MsgpackRecordReader extends AbstractRecordReader {
  private static final org.slf4j.Logger logger =
org.slf4j.LoggerFactory.getLogger(MsgpackRecordReader.class);

  public static final long DEFAULT_ROWS_PER_BATCH =
BaseValueVector.INITIAL_VALUE_ALLOCATION * 4;

  @Test
  public void testSchemaArrayOfArrayCell() throws Exception {
    LogFixtureBuilder logBuilder = LogFixture.builder()
        // Log to the console for debugging convenience
        .toConsole().logger("org.apache.drill.exec", Level.TRACE);
    try (LogFixture logs = logBuilder.build()) {
      learnModel();
      String sql = "select root.arrayOfarray[0][0] as w from
dfs.data.`secondBatchHasCompleteModel.mp` as root";
      rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator();

      schemaBuilder.add("w", TypeProtos.MinorType.BIGINT,
TypeProtos.DataMode.OPTIONAL);
      expectedSchema = schemaBuilder.buildSchema();
      verifyFirstBatchNull();

      rowSetBuilder = newRowSetBuilder();
      rowSetBuilder.addRow(1L);
      verify(rowSetBuilder.build(), nextRowSet());
    }
  }






java.lang.AssertionError: null
        at
org.apache.drill.exec.vector.complex.RepeatedListVector$DelegateRepeatedVector$RepeatedListAccessor.get(RepeatedListVector.java:73)
~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
        at
org.apache.drill.exec.vector.complex.impl.RepeatedListReaderImpl.setPosition(RepeatedListReaderImpl.java:95)
~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
        at
org.apache.drill.exec.test.generated.ProjectorGen1.doEval(ProjectorTemplate.java:27)
~[na:na]
        at
org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords(ProjectorTemplate.java:67)
~[na:na]
        at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork(ProjectRecordBatch.java:232)
~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
        at
org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117)
~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]

Re: msgpack read batch size larger than 4096 causes assertion error

Reply via email to