Re: msgpack read batch size larger than 4096 causes assertion error

Jean-Claude Cote Sun, 14 Oct 2018 18:15:02 -0700

Hey Paul, we think alike ;-) that's exactly what I was doing the past
couple of days. I was simplifying my test case and trying the same scenario
with the JSON reader. I was able to reproduce the same issue using the JSON
reader.


  @Test
  public void testArrayOfArrayJson() throws Exception {
    try (OutputStreamWriter w = new OutputStreamWriter(new
FileOutputStream(new File(testDir, "test.json")))) {
      w.write("{\"arrayOfArray\":[[1],[1,2]]}\n");
      for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH; i++) {
        w.write("{\"anInt\":1}\n");
      }
    }
    LogFixtureBuilder logBuilder = LogFixture.builder()
        // Log to the console for debugging convenience
        .toConsole().logger("org.apache.drill.exec", Level.TRACE);
    try (LogFixture logs = logBuilder.build()) {
      String sql = "select root.arrayOfArray[0][0] as w from
`dfs.data`.`test.json` as root";
      rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator();

      schemaBuilder = new SchemaBuilder();
      schemaBuilder.add("w", TypeProtos.MinorType.BIGINT,
DataMode.OPTIONAL);
      expectedSchema = schemaBuilder.buildSchema();

      DirectRowSet batch1 = nextRowSet();
      rowSetBuilder = newRowSetBuilder();
      rowSetBuilder.addRow(1L);
      for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH - 1; i++)
{
        rowSetBuilder.addRow(new Object[] { null });
      }
      verify(rowSetBuilder.build(), batch1);
      DirectRowSet batch2 = nextRowSet();
      rowSetBuilder = newRowSetBuilder();
      rowSetBuilder.addRow(new Object[] { null });
      verify(rowSetBuilder.build(), batch2);
    }
  }

The test passes. Then I change
  public static final long DEFAULT_ROWS_PER_BATCH =
BaseValueVector.INITIAL_VALUE_ALLOCATION ;
to be
  public static final long DEFAULT_ROWS_PER_BATCH =
BaseValueVector.INITIAL_VALUE_ALLOCATION + 1;
and the test case fails.

I can attach the whole trace output if you like.

On Sat, Oct 13, 2018 at 7:44 PM Paul Rogers <[email protected]>
wrote:

> Hi JC,
>
> Your test code looks OK. Looks like you've gotten quite good at using the
> row set classes for testing. If you like, you can simplify your code just a
> bit by using the rowSet() method of the query builder: it skips the first
> empty batch for you. Plus, if your result is only a single row with a
> single long column, you can further simplify by calling singletonLong()
> which will grab just that one value.
>
> Here is another thought. Perhaps you are hitting an existing bug somewhere
> in Drill. The Repeated List vector, as noted previously, is under-used and
> may still contain bugs in some operators. Let's see if we can rule out this
> case. I suggest this because, when I was testing an updated version of the
> JSON reader, I did encounter bugs elsewhere in Drill, but I can't recall if
> the problem was with the LIST or REPEATED LIST type...
>
> Try converting your data to JSON then issue the same query using the JSON
> reader. This will tell you if the bug is in your code (JSON succeeds) or in
> Drill itself (JSON fails.) You may have to temporarily force the JSON
> reader to read the same number of records per batch as your reader does.
>
> Thanks,
>
> - Paul
>
>
>
>     On Friday, October 12, 2018, 6:22:40 PM PDT, Paul Rogers <
> [email protected]> wrote:
>
>  Drill enforces two hard limits:
> 1. The maximum number of rows in a batch is 64K.
> 2. The maximum size of any vector is 4 GB.
>
> We have found, however, that fragmentation occurs in our memory allocator
> for any vector larger than 16 MB. (This is, in fact the original reason for
> the result set loader stuff I've been rambling on about.)
>
> Your DEFAULT_ROWS_PER_BATCH is now set to 4K * 4 = 16K. This is a fine
> number of rows (completely depending, of course, on row width.)
>
> The problem you are having is that you are trying to index a repeated list
> vector past its end. This very likely means that your code that built the
> vector has bugs.
>
> RepeatedList is tricky: it is an offset vector that wraps a Repeated
> vector. It is important to get all those offsets just right. Remember that,
> in offset vectors, the offset is one greater than the value. (Row 3 needs
> an offset written in offset vector position 4.)
>
> Here I'll gently suggest the use of the RowSet abstractions which have
> been tested to ensure that they do properly construct each form of vector.
> Let that code do the dirty work for you of mucking with the various offsets.
>
> Alternatively, look at the RowSet (column writers) or ComplexWriter to to
> see if you can figure out what those mechanims are doing that your code is
> missing.
>
> Here's how I'd debug this. Write a test that an exercise your reader in
> isolation. That is, exercise the reader outside of any query, just by
> itself. Doing so is a bit tricky given how the scan operator works, but is
> possible. Check out the external sort unit tests for some examples; perhaps
> other developers can point out to others.
>
> Configure the reader to read a simple file with just a few rows. Create
> files that include each type. (Easier to test if you include a few columns
> in each of several files, rather than one big file with all column types.)
> This will give you a record batch with what was read.
>
> Then, use the RowSet mechanisms to build up an expected record batch, then
> compare the expected value with your actual value. This is a much easier
> mechanism that using the Project operator to catch your vector structure
> errors.
>
> I hope this helps...
>
> Thanks,
> - Paul
>
>
>
>     On Friday, October 12, 2018, 5:31:53 PM PDT, Jean-Claude Cote <
> [email protected]> wrote:
>
>  I've changed my batch size record reader to be larger. All my test cases
> still work as I would expect them, except for 1 and I have no idea why? I'v
> turned on tracing in the hopes of getting a hint. I now see it is in a
> generated projection class but I'm not sure why.. Can anyone speculate why
> a change in batch size would make cause such a failure?
>
> I've added my record reader change, test case and error from the trace.
> Thanks
> jc
>
> public class MsgpackRecordReader extends AbstractRecordReader {
>   private static final org.slf4j.Logger logger =
> org.slf4j.LoggerFactory.getLogger(MsgpackRecordReader.class);
>
>   public static final long DEFAULT_ROWS_PER_BATCH =
> BaseValueVector.INITIAL_VALUE_ALLOCATION * 4;
>
>   @Test
>   public void testSchemaArrayOfArrayCell() throws Exception {
>     LogFixtureBuilder logBuilder = LogFixture.builder()
>         // Log to the console for debugging convenience
>         .toConsole().logger("org.apache.drill.exec", Level.TRACE);
>     try (LogFixture logs = logBuilder.build()) {
>       learnModel();
>       String sql = "select root.arrayOfarray[0][0] as w from
> dfs.data.`secondBatchHasCompleteModel.mp` as root";
>       rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator();
>
>       schemaBuilder.add("w", TypeProtos.MinorType.BIGINT,
> TypeProtos.DataMode.OPTIONAL);
>       expectedSchema = schemaBuilder.buildSchema();
>       verifyFirstBatchNull();
>
>       rowSetBuilder = newRowSetBuilder();
>       rowSetBuilder.addRow(1L);
>       verify(rowSetBuilder.build(), nextRowSet());
>     }
>   }
>
>
>
>
>
>
> java.lang.AssertionError: null
>         at
>
> org.apache.drill.exec.vector.complex.RepeatedListVector$DelegateRepeatedVector$RepeatedListAccessor.get(RepeatedListVector.java:73)
> ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>         at
>
> org.apache.drill.exec.vector.complex.impl.RepeatedListReaderImpl.setPosition(RepeatedListReaderImpl.java:95)
> ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>         at
>
> org.apache.drill.exec.test.generated.ProjectorGen1.doEval(ProjectorTemplate.java:27)
> ~[na:na]
>         at
>
> org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords(ProjectorTemplate.java:67)
> ~[na:na]
>         at
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork(ProjectRecordBatch.java:232)
> ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>         at
>
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117)
> ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT]
>

Re: msgpack read batch size larger than 4096 causes assertion error

Reply via email to