[jira] [Resolved] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

Bryan Cutler (Jira) Fri, 25 Feb 2022 10:23:04 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-14549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Cutler resolved ARROW-14549.
----------------------------------
    Resolution: Not A Problem

> VectorSchemaRoot is not refreshed when value is null
> ----------------------------------------------------
>
>                 Key: ARROW-14549
>                 URL: https://issues.apache.org/jira/browse/ARROW-14549
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 6.0.0
>            Reporter: Wenbo Hu
>            Priority: Major
>
> I'm using `arrow-jdbc` to convert query result from JDBC to arrow.
>  But the following code, unexpected behaivor happens.
> Assuming a sqlite db, the 2nd row of col_2 and col_3 are null.
> |col_1|col_2|col_3|
> |1|abc|3.14|
> |2|NULL|NULL|
> As document suggests,
> {quote}populated data over and over into the same VectorSchemaRoot in a 
> stream of batches rather than creating a new VectorSchemaRoot instance each 
> time.
> {quote}
> *JdbcToArrowConfig* is set to reuse root.
> {code:java}
> public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>      final Statement stmt = conn.createStatement();
>      final ResultSet rs = stmt.executeQuery(query)
>  ) {
>  // create config with reuse schema root and custom batch size from option
>      final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>      
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(true).build();
>   final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>    while (iterator.hasNext()){ // retrieve result from iterator 
>      final VectorSchemaRoot root = iterator.next(); 
> option.getCallback().handleBatchResult(root); 
>      root.allocateNew(); // it has to be allocate new 
>    }
>   } catch (java.lang.Exception e){ throw new Exception(e.getMessage()); }
>  }
>  
>  ......
>  // batch_size is set to 1, then callback is called twice.
>  QueryOptions options = new QueryOption(1, 
>      root -> {
>  // if printer is not set, get schema, write header
>  if (printer == null) { 
>       final String[] headers = 
> root.getSchema().getFields().stream().map(Field::getName).toArray(String[]::new);
>  
>       printer = new CSVPrinter(writer, 
> CSVFormat.Builder.create(CSVFormat.DEFAULT).setHeader(headers).build()); 
>   }
>  
>  final int rows = root.getRowCount();
>  final List<FieldVector> fieldVectors = root.getFieldVectors();
>  
>  // iterate over rows
>  for (int i = 0; i < rows; i++) { 
>       final int rowId = i; 
>       final List<String> row = fieldVectors.stream().map(v -> 
> v.getObject(rowId)).map(String::valueOf).collect(Collectors.toList()); 
> printer.printRecord(row); 
>   }
>  });
>  
>  connection.querySql("SELECT * FROM test_db", options);
>  ......
> {code}
> if `root.allocateNew()` is called, the csv file is expected,
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,null,null
>  ```
>  Otherwise, null values of 2nd row are remaining the same values of 1st row
>  ```
>  column_1,column_2,column_3
>  1,abc,3.14
>  2,abc,3.14
>  ```
> **Question: Is expected to call `allocateNew` every time when the schema root 
> is reused?**
> By without reusing schemaroot, the following code works as expected.
> {code:java}
>  public void querySql(String query, QueryOption option) throws Exception {
>  try (final java.sql.Connection conn = connectContainer.getConnection();
>      final Statement stmt = conn.createStatement();
>      final ResultSet rs = stmt.executeQuery(query)) {
>      // create config without reuse schema root and custom batch size from 
> option
>      final JdbcToArrowConfig config = new 
> JdbcToArrowConfigBuilder().setAllocator(new 
> RootAllocator()).setCalendar(JdbcToArrowUtils.getUtcCalendar())
>      
> .setTargetBatchSize(option.getBatchSize()).setReuseVectorSchemaRoot(false).build();
>  
>      final ArrowVectorIterator iterator = 
> JdbcToArrow.sqlToArrowVectorIterator(rs, config);
>      while (iterator.hasNext()) {
>      // retrieve result from iterator
>      try (VectorSchemaRoot root = iterator.next()) { 
>           option.getCallback().handleBatchResult(root); root.allocateNew(); 
>       }
>    }
>  } catch (java.lang.Exception e) { throw new Exception(e.getMessage()); }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-14549) VectorSchemaRoot is not refreshed when value is null

Reply via email to