janosik47 opened a new issue, #37056:
URL: https://github.com/apache/arrow/issues/37056
### Describe the bug, including details regarding any error messages,
version, and platform.
Java import from c-data arrays throws an exception when attempting to
construct a vector for which the data buffer is empty.
Example: importing an empty list of Int32 primitives throws the following
```
Exception in thread "main" java.lang.IllegalArgumentException: Could not
load buffers for field $data$: Int(32, true). error message: Buffer 1 for type
Int(32, true) cannot be null
at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:131)
at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
at org.apache.arrow.c.ArrayImporter.importArray(ArrayImporter.java:71)
at org.apache.arrow.c.Data.importIntoVector(Data.java:289)
at org.apache.arrow.c.Data.importIntoVectorSchemaRoot(Data.java:332)
at
org.apache.arrow.dataset.jni.NativeScanner$NativeReader.loadNextBatch(NativeScanner.java:151)
```
## How to reproduce
### Creation of a sample document (Jupyter notebook) :
python: 3.10.4
pyarrow: 12.0.1
pandas: 2.0.3
```python
import pyarrow as pa
import pyarrow.feather as pf
import pandas as pd
schema = pa.schema([
pa.field("a", pa.list_(pa.int32()), True),
])
df = pd.DataFrame(columns=["a"],index=range(1))
df.iloc[0] = [[]]
table = pa.table(df, schema)
pf.write_feather(table, "/tmp/sample.feather", compression="uncompressed")
```
### Access the document via Java DataSet API (kotlin):
JVM: openjdk/20.0.1
arrow: 12.0.1
kotlin: 1.9.0
```kotlin
val allocator = RootAllocator()
val nativeMemoryPool = NativeMemoryPool.getDefault()
val factory = FileSystemDatasetFactory(allocator, nativeMemoryPool,
FileFormat.ARROW_IPC, "file:/tmp/sample.feather")
factory.finish().use { dataset ->
dataset.newScan(ScanOptions(10L)).use { scanner ->
scanner.scanBatches().use { reader ->
println("$path:
schema=${reader.vectorSchemaRoot.schema.toJson()}")
while (reader.loadNextBatch()) {
println(reader.vectorSchemaRoot.contentToTSVString())
}
}
}
}
```
### Comment
Quick look at the Java code shows that the
[ArrayImporter](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/ArrayImporter.java#L126)
class uses an instance of
[BufferImportTypeVisitor](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java)
which performs import vector's buffers based on the knowledge of the field
data type.
In this case the [visit(ArrowType.Int
type)](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L181-L183)
method is called which accepts nullable bit mask buffer
([here](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L124-L132))
but demands non-nullable data buffer
([here](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L93-L99)
& then
[here](https://github.com/apache/arrow/blob/main/java/c/src/main/java/org/apache/arrow/c/BufferImportTypeVisitor.java#L83-L91)).
As my understanding is from the Vector perspective the data buffer must not
be null hence the visitor enforces it, however according to the C data
ArrowArray
[spec](https://arrow.apache.org/docs/dev/format/CDataInterface.html#the-arrowarray-structure)
it can hold null buffers:
> The buffer pointers MAY be null only in two situations:
>
> 1. for the null bitmap buffer, if ArrowArray.null_count is 0;
> 2. for any buffer, if the size in bytes of the corresponding buffer would
be 0.
Based on the above, seems to me, the BufferImportTypeVisitor could create an
empty data buffer if the corresponding c data one is null and the filed is
empty (fieldNode.length == 0) or throw if the field is not empty.
### Component(s)
Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]