[
https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bryan Cutler resolved ARROW-14549.
----------------------------------
Resolution: Not A Problem
> VectorSchemaRoot is not refreshed when value is null
> ----------------------------------------------------
>
> Key: ARROW-14549
> URL: https://issues.apache.org/jira/browse/ARROW-14549
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java
> Affects Versions: 6.0.0
> Reporter: Wenbo Hu
> Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
> But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a
> stream of batches rather than creating a new VectorSchemaRoot instance each
> time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
> try (final java.sql.Connection conn = connectContainer.getConnection();
> final Statement stmt = conn.createStatement();
> final ResultSet rs = stmt.executeQuery(query)
> ) {
> // create config with reuse schema root and custom batch size from option
> final JdbcToArrowConfig config = new
> JdbcToArrowConfigBuilder().setAllocator(new
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
> final ArrowVectorIterator iterator =
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
> while (iterator.hasNext()){ // retrieve result from iterator
> final VectorSchemaRoot root = iterator.next();
> option.getCallback().handleBatchResult(root);
> root.allocateNew(); // it has to be allocate new
> }
> } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
> }
>
> ......
> // batch_size is set to 1, then callback is called twice.
> QueryOptions options = new QueryOption(1,
> root -> {
> // if printer is not set, get schema, write header
> if (printer == null) {
> final String[] headers =
> root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
>
> printer = new CSVPrinter(writer,
> CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build());
> }
>
> final int rows = root.getRowCount();
> final List<FieldVector> fieldVectors = root.getFieldVectors();
>
> // iterate over rows
> for (int i = 0; i < rows; i++) {
> final int rowId = i;
> final List<String> row = fieldVectors.stream().map(v ->
> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList());
> printer.printRecord(row);
> }
> });
>
> connection.querySql("SELECT * FROM test_db", options);
> ......
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
> ```
> column_1,column_2,column_3
> 1,abc,3.14
> 2,null,null
> ```
> Otherwise, null values of 2nd row are remaining the same values of 1st row
> ```
> column_1,column_2,column_3
> 1,abc,3.14
> 2,abc,3.14
> ```
> **Question: Is expected to call `allocateNew` every time when the schema root
> is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
> try (final java.sql.Connection conn = connectContainer.getConnection();
> final Statement stmt = conn.createStatement();
> final ResultSet rs = stmt.executeQuery(query)) {
> // create config without reuse schema root and custom batch size from
> option
> final JdbcToArrowConfig config = new
> JdbcToArrowConfigBuilder().setAllocator(new
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>
> final ArrowVectorIterator iterator =
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
> while (iterator.hasNext()) {
> // retrieve result from iterator
> try (VectorSchemaRoot root = iterator.next()) {
> option.getCallback().handleBatchResult(root); root.allocateNew();
> }
> }
> } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)