[GitHub] [arrow] janosik47 opened a new issue, #37056: [Java] [c-data] [dataset] Exception when importing a vector with empty data array from c-data

via GitHub Tue, 08 Aug 2023 00:58:21 -0700


janosik47 opened a new issue, #37056:
URL: https://github.com/apache/arrow/issues/37056


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Java import from c-data arrays throws an exception when attempting to 
construct a vector for which the data buffer is empty.
   Example: importing an empty list of Int32 primitives throws the following 
   
    
   ```
   Exception in thread "main" java.lang.IllegalArgumentException: Could not 
load buffers for field $data$: Int(32, true). error message: Buffer 1 for type 
Int(32, true) cannot be null
        at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:131)
        at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
        at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
        at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
        at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
        at org.apache.arrow.c.ArrayImporter.importArray(ArrayImporter.java:71)
        at org.apache.arrow.c.Data.importIntoVector(Data.java:289)
        at org.apache.arrow.c.Data.importIntoVectorSchemaRoot(Data.java:332)
        at 
org.apache.arrow.dataset.jni.NativeScanner$NativeReader.loadNextBatch(NativeScanner.java:151)
   ```
   
   ## How to reproduce
   ### Creation of a sample document (Jupyter notebook) :
   python: 3.10.4
   pyarrow: 12.0.1
   pandas: 2.0.3
   ```python
   import pyarrow as pa
   import pyarrow.feather as pf
   import pandas as pd
   
   schema = pa.schema([
       pa.field("a", pa.list_(pa.int32()), True),
   ])
   df = pd.DataFrame(columns=["a"],index=range(1))
   df.iloc[0] = [[]]
   table = pa.table(df, schema)
   pf.write_feather(table, "/tmp/sample.feather", compression="uncompressed")
   ```
   
   ### Access the document via Java DataSet API (kotlin):
   JVM: openjdk/20.0.1
   arrow: 12.0.1
   kotlin: 1.9.0
   ```kotlin
       val allocator = RootAllocator()
       val nativeMemoryPool = NativeMemoryPool.getDefault()
       
       val factory = FileSystemDatasetFactory(allocator, nativeMemoryPool, 
FileFormat.ARROW_IPC, "file:/tmp/sample.feather")
       factory.finish().use { dataset ->
           dataset.newScan(ScanOptions(10L)).use { scanner ->
               scanner.scanBatches().use { reader ->
                   println("$path: 
schema=${reader.vectorSchemaRoot.schema.toJson()}")
   
                   while (reader.loadNextBatch()) {
                       println(reader.vectorSchemaRoot.contentToTSVString())
                   }
               }
           }
       }
   ```
   
   ### Comment
   Quick look at the Java code shows that the 
[ArrayImporter](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/ArrayImporter.java#L126)
 class uses an instance of 
[BufferImportTypeVisitor](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java)
 which performs import vector's buffers based on the knowledge of the field 
data type.
   
   In this case the [visit(ArrowType.Int 
type)](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L181-L183)
 method is called which accepts nullable bit mask buffer 
([here](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L124-L132))
 but demands non-nullable data buffer 
([here](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L93-L99)
 & then 
[here](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L83-L91)).
   
   As my understanding is from the Vector perspective the data buffer must not 
be null hence the visitor enforces it, however according to the C data 
ArrowArray 
[spec](https://arrow.apache.org/docs/dev/format/CDataInterface.html#the-arrowarray-structure)
 it can hold null buffers: 
   > The buffer pointers MAY be null only in two situations:
   >
   > 1. for the null bitmap buffer, if ArrowArray.null_count is 0;
   > 2. for any buffer, if the size in bytes of the corresponding buffer would 
be 0.
   
   Based on the above, seems to me, the BufferImportTypeVisitor could create an 
empty data buffer if the corresponding c data one is null and the filed is 
empty (fieldNode.length == 0) or throw if the field is not empty.
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] janosik47 opened a new issue, #37056: [Java] [c-data] [dataset] Exception when importing a vector with empty data array from c-data

Reply via email to