I use the rowSetIterator to get at the first batch. It should have one row with 1 in it, then the rest of the batch are nulls. The second batch should have 1 row with a null. The json file contains batch_size + 1 rows.
On Mon, Oct 15, 2018 at 2:32 PM Paul Rogers <[email protected]> wrote: > Hi JC, > > Sure looks like a bug. I'd suggest filing a JIRA ticket for this issue, > attaching your code and input file. > > The strange thing is that you suggest that Project works for smaller batch > sizes, but not for larger numbers, but the code of your tests suggests that > you expect the query to return a single row (or, are there more rows, but > you're just checking the first?) If just one row, the potential batch size > should not matter. Can you clarify? > > The two short-term options are to either track down the bug in Project, or > to use your earlier, smaller batch size. > > Thanks, > - Paul > > > > On Sunday, October 14, 2018, 6:14:46 PM PDT, Jean-Claude Cote < > [email protected]> wrote: > > Hey Paul, we think alike ;-) that's exactly what I was doing the past > couple of days. I was simplifying my test case and trying the same scenario > with the JSON reader. I was able to reproduce the same issue using the JSON > reader. > > @Test > public void testArrayOfArrayJson() throws Exception { > try (OutputStreamWriter w = new OutputStreamWriter(new > FileOutputStream(new File(testDir, "test.json")))) { > w.write("{\"arrayOfArray\":[[1],[1,2]]}\n"); > for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH; i++) { > w.write("{\"anInt\":1}\n"); > } > } > LogFixtureBuilder logBuilder = LogFixture.builder() > // Log to the console for debugging convenience > .toConsole().logger("org.apache.drill.exec", Level.TRACE); > try (LogFixture logs = logBuilder.build()) { > String sql = "select root.arrayOfArray[0][0] as w from > `dfs.data`.`test.json` as root"; > rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator(); > > schemaBuilder = new SchemaBuilder(); > schemaBuilder.add("w", TypeProtos.MinorType.BIGINT, > DataMode.OPTIONAL); > expectedSchema = schemaBuilder.buildSchema(); > > DirectRowSet batch1 = nextRowSet(); > rowSetBuilder = newRowSetBuilder(); > rowSetBuilder.addRow(1L); > for (int i = 0; i < JSONRecordReader.DEFAULT_ROWS_PER_BATCH - 1; i++) > { > rowSetBuilder.addRow(new Object[] { null }); > } > verify(rowSetBuilder.build(), batch1); > DirectRowSet batch2 = nextRowSet(); > rowSetBuilder = newRowSetBuilder(); > rowSetBuilder.addRow(new Object[] { null }); > verify(rowSetBuilder.build(), batch2); > } > } > > The test passes. Then I change > public static final long DEFAULT_ROWS_PER_BATCH = > BaseValueVector.INITIAL_VALUE_ALLOCATION ; > to be > public static final long DEFAULT_ROWS_PER_BATCH = > BaseValueVector.INITIAL_VALUE_ALLOCATION + 1; > and the test case fails. > > I can attach the whole trace output if you like. > > On Sat, Oct 13, 2018 at 7:44 PM Paul Rogers <[email protected]> > wrote: > > > Hi JC, > > > > Your test code looks OK. Looks like you've gotten quite good at using the > > row set classes for testing. If you like, you can simplify your code > just a > > bit by using the rowSet() method of the query builder: it skips the first > > empty batch for you. Plus, if your result is only a single row with a > > single long column, you can further simplify by calling singletonLong() > > which will grab just that one value. > > > > Here is another thought. Perhaps you are hitting an existing bug > somewhere > > in Drill. The Repeated List vector, as noted previously, is under-used > and > > may still contain bugs in some operators. Let's see if we can rule out > this > > case. I suggest this because, when I was testing an updated version of > the > > JSON reader, I did encounter bugs elsewhere in Drill, but I can't recall > if > > the problem was with the LIST or REPEATED LIST type... > > > > Try converting your data to JSON then issue the same query using the JSON > > reader. This will tell you if the bug is in your code (JSON succeeds) or > in > > Drill itself (JSON fails.) You may have to temporarily force the JSON > > reader to read the same number of records per batch as your reader does. > > > > Thanks, > > > > - Paul > > > > > > > > On Friday, October 12, 2018, 6:22:40 PM PDT, Paul Rogers < > > [email protected]> wrote: > > > > Drill enforces two hard limits: > > 1. The maximum number of rows in a batch is 64K. > > 2. The maximum size of any vector is 4 GB. > > > > We have found, however, that fragmentation occurs in our memory allocator > > for any vector larger than 16 MB. (This is, in fact the original reason > for > > the result set loader stuff I've been rambling on about.) > > > > Your DEFAULT_ROWS_PER_BATCH is now set to 4K * 4 = 16K. This is a fine > > number of rows (completely depending, of course, on row width.) > > > > The problem you are having is that you are trying to index a repeated > list > > vector past its end. This very likely means that your code that built the > > vector has bugs. > > > > RepeatedList is tricky: it is an offset vector that wraps a Repeated > > vector. It is important to get all those offsets just right. Remember > that, > > in offset vectors, the offset is one greater than the value. (Row 3 needs > > an offset written in offset vector position 4.) > > > > Here I'll gently suggest the use of the RowSet abstractions which have > > been tested to ensure that they do properly construct each form of > vector. > > Let that code do the dirty work for you of mucking with the various > offsets. > > > > Alternatively, look at the RowSet (column writers) or ComplexWriter to to > > see if you can figure out what those mechanims are doing that your code > is > > missing. > > > > Here's how I'd debug this. Write a test that an exercise your reader in > > isolation. That is, exercise the reader outside of any query, just by > > itself. Doing so is a bit tricky given how the scan operator works, but > is > > possible. Check out the external sort unit tests for some examples; > perhaps > > other developers can point out to others. > > > > Configure the reader to read a simple file with just a few rows. Create > > files that include each type. (Easier to test if you include a few > columns > > in each of several files, rather than one big file with all column > types.) > > This will give you a record batch with what was read. > > > > Then, use the RowSet mechanisms to build up an expected record batch, > then > > compare the expected value with your actual value. This is a much easier > > mechanism that using the Project operator to catch your vector structure > > errors. > > > > I hope this helps... > > > > Thanks, > > - Paul > > > > > > > > On Friday, October 12, 2018, 5:31:53 PM PDT, Jean-Claude Cote < > > [email protected]> wrote: > > > > I've changed my batch size record reader to be larger. All my test cases > > still work as I would expect them, except for 1 and I have no idea why? > I'v > > turned on tracing in the hopes of getting a hint. I now see it is in a > > generated projection class but I'm not sure why.. Can anyone speculate > why > > a change in batch size would make cause such a failure? > > > > I've added my record reader change, test case and error from the trace. > > Thanks > > jc > > > > public class MsgpackRecordReader extends AbstractRecordReader { > > private static final org.slf4j.Logger logger = > > org.slf4j.LoggerFactory.getLogger(MsgpackRecordReader.class); > > > > public static final long DEFAULT_ROWS_PER_BATCH = > > BaseValueVector.INITIAL_VALUE_ALLOCATION * 4; > > > > @Test > > public void testSchemaArrayOfArrayCell() throws Exception { > > LogFixtureBuilder logBuilder = LogFixture.builder() > > // Log to the console for debugging convenience > > .toConsole().logger("org.apache.drill.exec", Level.TRACE); > > try (LogFixture logs = logBuilder.build()) { > > learnModel(); > > String sql = "select root.arrayOfarray[0][0] as w from > > dfs.data.`secondBatchHasCompleteModel.mp` as root"; > > rowSetIterator = client.queryBuilder().sql(sql).rowSetIterator(); > > > > schemaBuilder.add("w", TypeProtos.MinorType.BIGINT, > > TypeProtos.DataMode.OPTIONAL); > > expectedSchema = schemaBuilder.buildSchema(); > > verifyFirstBatchNull(); > > > > rowSetBuilder = newRowSetBuilder(); > > rowSetBuilder.addRow(1L); > > verify(rowSetBuilder.build(), nextRowSet()); > > } > > } > > > > > > > > > > > > > > java.lang.AssertionError: null > > at > > > > > org.apache.drill.exec.vector.complex.RepeatedListVector$DelegateRepeatedVector$RepeatedListAccessor.get(RepeatedListVector.java:73) > > ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT] > > at > > > > > org.apache.drill.exec.vector.complex.impl.RepeatedListReaderImpl.setPosition(RepeatedListReaderImpl.java:95) > > ~[vector-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT] > > at > > > > > org.apache.drill.exec.test.generated.ProjectorGen1.doEval(ProjectorTemplate.java:27) > > ~[na:na] > > at > > > > > org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords(ProjectorTemplate.java:67) > > ~[na:na] > > at > > > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork(ProjectRecordBatch.java:232) > > ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT] > > at > > > > > org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117) > > ~[drill-java-exec-1.15.0-SNAPSHOT.jar:1.15.0-SNAPSHOT] > > >
