Wenbo Hu created ARROW-14549:
--------------------------------
Summary: VectorSchemaRoot is not refreshed when value is null
Key: ARROW-14549
URL: https://issues.apache.org/jira/browse/ARROW-14549
Project: Apache Arrow
Issue Type: Bug
Components: Java
Affects Versions: 6.0.0
Reporter: Wenbo Hu
I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
But the following code, unexpected behaivor happens.
Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
|col_1|col_2|col_3|
|-------|--------|--------|
|1|abc|3.14|
|2|NULL|NULL|
As document suggests,
bq. populated data over and over into the same VectorSchemaRoot in a stream of
batches rather than creating a new VectorSchemaRoot instance each time.
*JdbcToArrowConfig* is set to reuse root.
{code:java}
public void querySql(String query, QueryOption option) throws Exception {
try (final java.sql.Connection conn = connectContainer.getConnection();
final Statement stmt = conn.createStatement();
final ResultSet rs = stmt.executeQuery(query)
) {
// create config with reuse schema root and custom batch size from option
final JdbcToArrowConfig config = new
JdbcToArrowConfigBuilder().setAllocator(new
RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
.setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs,
config);
while (iterator.hasNext()){ // retrieve result from iterator
final VectorSchemaRoot root = iterator.next();
option.getCallback().handleBatchResult(root);
root.allocateNew(); // it has to be allocate new
}
} catch (java.lang.Exception e)
{ throw new Exception(e.getMessage()); }
}
......
// batch_size is set to 1, then callback is called twice.
QueryOptions options = new QueryOption(1,
root -> {
// if printer is not set, get schema, write header
if (printer == null) \{ final String[] headers =
root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
printer = new CSVPrinter(writer,
CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); }
final int rows = root.getRowCount();
final List<FieldVector> fieldVectors = root.getFieldVectors();
// iterate over rows
for (int i = 0; i < rows; i++) \{ final int rowId = i; final List<String> row
= fieldVectors.stream().map(v ->
v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList());
printer.printRecord(row); }
});
connection.querySql("SELECT * FROM test_db", options);
......
{code}
if `root.allocateNew()` is called, the csv file is expected,
```
column_1,column_2,column_3
1,abc,3.14
2,null,null
```
Otherwise, null values of 2nd row are remaining the same values of 1st row
```
column_1,column_2,column_3
1,abc,3.14
2,abc,3.14
```
**Question: Is expected to call `allocateNew` every time when the schema root
is reused?**
By without reusing schemaroot, the following code works as expected.
{code:java}
public void querySql(String query, QueryOption option) throws Exception {
try (final java.sql.Connection conn = connectContainer.getConnection();
final Statement stmt = conn.createStatement();
final ResultSet rs = stmt.executeQuery(query)
) {
// create config without reuse schema root and custom batch size from option
final JdbcToArrowConfig config = new
JdbcToArrowConfigBuilder().setAllocator(new
RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
.setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
final ArrowVectorIterator iterator = JdbcToArrow.sqlToArrowVectorIterator(rs,
config);
while (iterator.hasNext()) {
// retrieve result from iterator
try (VectorSchemaRoot root = iterator.next()) \{
option.getCallback().handleBatchResult(root); root.allocateNew(); }
}
} catch (java.lang.Exception e) \{ throw new Exception(e.getMessage()); }
}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)