Re: msgpack read batch size larger than 4096 causes assertion error

Paul Rogers Mon, 15 Oct 2018 11:33:01 -0700

Hi JC,

Sure looks like a bug. I'd suggest filing a JIRA ticket for this issue, 
attaching your code and input file.


The strange thing is that you suggest that Project works for smaller batch 
sizes, but not for larger numbers, but the code of your tests suggests that you 
expect the query to return a single row (or, are there more rows, but you're 
just checking the first?) If just one row, the potential batch size should not 
matter. Can you clarify?

The two short-term options are to either track down the bug in Project, or to 
use your earlier, smaller batch size.

Thanks,
- Paul

 

    On Sunday, October 14, 2018, 6:14:46 PM PDT, Jean-Claude Cote 
<jcc...@gmail.com> wrote:  
 
 Hey Paul, we think alike ;-) that's exactly what I was doing the past
couple of days. I was simplifying my test case and trying the same scenario
with the JSON reader. I was able to reproduce the same issue using the JSON
reader.

  @Test
  public void testArrayOfArrayJson() throws Exception {
    try (OutputStreamWriter w = new OutputStreamWriter(new
FileOutputStream(new File(testDir, "test.json")))) {
      w.write("{\"arrayOfArray\":[[1],[1,2]]}\n");
      for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH; i++) {
        w.write("{\"anInt\":1}\n");
      }
    }
    LogFixtureBuilder logBuilder = LogFixture.builder()
        // Log to the console for debugging convenience
        .toConsole().logger("org.apache.drill.exec", Level.TRACE);
    try (LogFixture logs = logBuilder.build()) {
      String sql = "select root.arrayOfArray[0][0] as w from
`dfs.data`.`test.json` as root";
      rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator();

      schemaBuilder = new SchemaBuilder();
      schemaBuilder.add("w", TypeProtos.MinorType.BIGINT,
DataMode.OPTIONAL);
      expectedSchema = schemaBuilder.buildSchema();

      DirectRowSet batch1 = nextRowSet();
      rowSetBuilder = newRowSetBuilder();
      rowSetBuilder.addRow(1L);
      for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH - 1; i++)
{
        rowSetBuilder.addRow(new Object[] { null });
      }
      verify(rowSetBuilder.build(), batch1);
      DirectRowSet batch2 = nextRowSet();
      rowSetBuilder = newRowSetBuilder();
      rowSetBuilder.addRow(new Object[] { null });
      verify(rowSetBuilder.build(), batch2);
    }
  }

The test passes. Then I change
  public static final long DEFAULT_ROWS_PER_BATCH =
BaseValueVector.INITIAL_VALUE_ALLOCATION ;
to be
  public static final long DEFAULT_ROWS_PER_BATCH =
BaseValueVector.INITIAL_VALUE_ALLOCATION + 1;
and the test case fails.

I can attach the whole trace output if you like.

On Sat, Oct 13, 2018 at 7:44 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi JC,
>
> Your test code looks OK. Looks like you've gotten quite good at using the
> row set classes for testing. If you like, you can simplify your code just a
> bit by using the rowSet() method of the query builder: it skips the first
> empty batch for you. Plus, if your result is only a single row with a
> single long column, you can further simplify by calling singletonLong()
> which will grab just that one value.
>
> Here is another thought. Perhaps you are hitting an existing bug somewhere
> in Drill. The Repeated List vector, as noted previously, is under-used and
> may still contain bugs in some operators. Let's see if we can rule out this
> case. I suggest this because, when I was testing an updated version of the
> JSON reader, I did encounter bugs elsewhere in Drill, but I can't recall if
> the problem was with the LIST or REPEATED LIST type...
>
> Try converting your data to JSON then issue the same query using the JSON
> reader. This will tell you if the bug is in your code (JSON succeeds) or in
> Drill itself (JSON fails.) You may have to temporarily force the JSON
> reader to read the same number of records per batch as your reader does.
>
> Thanks,
>
> - Paul
>
>
>
>    On Friday, October 12, 2018, 6:22:40 PM PDT, Paul Rogers <
> par0...@yahoo.com> wrote:
>
>  Drill enforces two hard limits:
> 1. The maximum number of rows in a batch is 64K.
> 2. The maximum size of any vector is 4 GB.
>
> We have found, however, that fragmentation occurs in our memory allocator
> for any vector larger than 16 MB. (This is, in fact the original reason for
> the result set loader stuff I've been rambling on about.)
>
> Your DEFAULT_ROWS_PER_BATCH is now set to 4K * 4 = 16K. This is a fine
> number of rows (completely depending, of course, on row width.)
>
> The problem you are having is that you are trying to index a repeated list
> vector past its end. This very likely means that your code that built the
> vector has bugs.
>
> RepeatedList is tricky: it is an offset vector that wraps a Repeated
> vector. It is important to get all those offsets just right. Remember that,
> in offset vectors, the offset is one greater than the value. (Row 3 needs
> an offset written in offset vector position 4.)
>
> Here I'll gently suggest the use of the RowSet abstractions which have
> been tested to ensure that they do properly construct each form of vector.
> Let that code do the dirty work for you of mucking with the various offsets.
>
> Alternatively, look at the RowSet (column writers) or ComplexWriter to to
> see if you can figure out what those mechanims are doing that your code is
> missing.
>
> Here's how I'd debug this. Write a test that an exercise your reader in
> isolation. That is, exercise the reader outside of any query, just by
> itself. Doing so is a bit tricky given how the scan operator works, but is
> possible. Check out the external sort unit tests for some examples; perhaps
> other developers can point out to others.
>
> Configure the reader to read a simple file with just a few rows. Create
> files that include each type. (Easier to test if you include a few columns
> in each of several files, rather than one big file with all column types.)
> This will give you a record batch with what was read.
>
> Then, use the RowSet mechanisms to build up an expected record batch, then
> compare the expected value with your actual value. This is a much easier
> mechanism that using the Project operator to catch your vector structure
> errors.
>
> I hope this helps...
>
> Thanks,
> - Paul
>
>
>
>    On Friday, October 12, 2018, 5:31:53 PM PDT, Jean-Claude Cote <
> jcc...@gmail.com> wrote:
>
>  I've changed my batch size record reader to be larger. All my test cases
> still work as I would expect them, except for 1 and I have no idea why? I'v
> turned on tracing in the hopes of getting a hint. I now see it is in a
> generated projection class but I'm not sure why.. Can anyone speculate why
> a change in batch size would make cause such a failure?
>
> I've added my record reader change, test case and error from the trace.
> Thanks
> jc
>
> public class MsgpackRecordReader extends AbstractRecordReader {
>  private static final org.slf4j.Logger logger =
> org.slf4j.LoggerFactory.getLogger(MsgpackRecordReader.class);
>
>  public static final long DEFAULT_ROWS_PER_BATCH =
> BaseValueVector.INITIAL_VALUE_ALLOCATION * 4;
>
>  @Test
>  public void testSchemaArrayOfArrayCell() throws Exception {
>    LogFixtureBuilder logBuilder = LogFixture.builder()
>        // Log to the console for debugging convenience
>        .toConsole().logger("org.apache.drill.exec", Level.TRACE);
>    try (LogFixture logs = logBuilder.build()) {
>      learnModel();
>      String sql = "select root.arrayOfarray[0][0] as w from
> dfs.data.`secondBatchHasCompleteModel.mp` as root";
>      rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator();
>
>      schemaBuilder.add("w", TypeProtos.MinorType.BIGINT,
> TypeProtos.DataMode.OPTIONAL);
>      expectedSchema = schemaBuilder.buildSchema();
>      verifyFirstBatchNull();
>
>      rowSetBuilder = newRowSetBuilder();
>      rowSetBuilder.addRow(1L);
>      verify(rowSetBuilder.build(), nextRowSet());
>    }
>  }
>
>
>
>
>
>
> java.lang.AssertionError: null
>        at
>
> org.apache.drill.exec.vector.complex.RepeatedListVector$DelegateRepeatedVector$RepeatedListAccessor.get(RepeatedListVector.java:73)
> ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>        at
>
> org.apache.drill.exec.vector.complex.impl.RepeatedListReaderImpl.setPosition(RepeatedListReaderImpl.java:95)
> ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>        at
>
> org.apache.drill.exec.test.generated.ProjectorGen1.doEval(ProjectorTemplate.java:27)
> ~[na:na]
>        at
>
> org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords(ProjectorTemplate.java:67)
> ~[na:na]
>        at
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork(ProjectRecordBatch.java:232)
> ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>        at
>
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117)
> ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>

Re: msgpack read batch size larger than 4096 causes assertion error

Reply via email to