[
https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philipp Moritz updated ARROW-1692:
----------------------------------
Description:
I'm currently working on making pyarrow.serialization data available from the
Java side, one problem I was running into is that it seems the Java
implementation cannot read UnionArrays generated from C++. To make this easily
reproducible I created a clean Python implementation for creating UnionArrays:
https://github.com/apache/arrow/pull/1216
The data is generated with the following script:
{code}
import pyarrow as pa
binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
int64 = pa.array([1, 2, 3], type='int64')
types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)
batch = pa.RecordBatch.from_arrays([result], ["test"])
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)
writer.write_batch(batch)
sink.close()
b = sink.get_result()
with open("union_array.arrow", "wb") as f:
f.write(b)
# Sanity check: Read the batch in again
with open("union_array.arrow", "rb") as f:
b = f.read()
reader = pa.RecordBatchStreamReader(pa.BufferReader(b))
batch = reader.read_next_batch()
print("union array is", batch.column(0))
{code}
I attached the file generated by that script. Then when I run the following
code in Java:
{code}
RootAllocator allocator = new RootAllocator(1000000000);
ByteArrayInputStream in = new
ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));
ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
reader.loadNextBatch()
{code}
I get the following error:
{code}
| java.lang.IllegalArgumentException thrown: Could not load buffers for field
test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can
not truncate buffer to a larger size 7: 0
| at VectorLoader.loadBuffers (VectorLoader.java:83)
| at VectorLoader.load (VectorLoader.java:62)
| at ArrowReader$1.visit (ArrowReader.java:125)
| at ArrowReader$1.visit (ArrowReader.java:111)
| at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
| at ArrowReader.loadNextBatch (ArrowReader.java:137)
| at (#7:1)
{code}
It seems like Java is not picking up that the UnionArray is Dense instead of
Sparse. After changing the default in
java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I
get this:
{code}
jshell> reader.getVectorSchemaRoot().getSchema()
$9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense,
[0])<: Int(64, true)>>>>>
{code}
but then reading doesn't work:
{code}
jshell> reader.loadNextBatch()
| java.lang.IllegalArgumentException thrown: Could not load buffers for field
list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<: Int(64,
true)>>>>. error message: can not truncate buffer to a larger size 1: 0
| at VectorLoader.loadBuffers (VectorLoader.java:83)
| at VectorLoader.load (VectorLoader.java:62)
| at ArrowReader$1.visit (ArrowReader.java:125)
| at ArrowReader$1.visit (ArrowReader.java:111)
| at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
| at ArrowReader.loadNextBatch (ArrowReader.java:137)
| at (#8:1)
{code}
Any help with this is appreciated!
was:
I'm currently working on making pyarrow.serialization data available from the
Java side, one problem I was running into is that it seems the Java
implementation cannot read UnionArrays generated from C++. To make this easily
reproducible I created a clean Python implementation for creating UnionArrays:
https://github.com/apache/arrow/pull/1216
The data is generated with the following script:
```
import pyarrow as pa
binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
int64 = pa.array([1, 2, 3], type='int64')
types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)
batch = pa.RecordBatch.from_arrays([result], ["test"])
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)
writer.write_batch(batch)
sink.close()
b = sink.get_result()
with open("union_array.arrow", "wb") as f:
f.write(b)
# Sanity check: Read the batch in again
with open("union_array.arrow", "rb") as f:
b = f.read()
reader = pa.RecordBatchStreamReader(pa.BufferReader(b))
batch = reader.read_next_batch()
print("union array is", batch.column(0))
```
I attached the file generated by that script. Then when I run the following
code in Java:
```
RootAllocator allocator = new RootAllocator(1000000000);
ByteArrayInputStream in = new
ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));
ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
reader.loadNextBatch()
```
I get the following error:
```
| java.lang.IllegalArgumentException thrown: Could not load buffers for field
test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can
not truncate buffer to a larger size 7: 0
| at VectorLoader.loadBuffers (VectorLoader.java:83)
| at VectorLoader.load (VectorLoader.java:62)
| at ArrowReader$1.visit (ArrowReader.java:125)
| at ArrowReader$1.visit (ArrowReader.java:111)
| at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
| at ArrowReader.loadNextBatch (ArrowReader.java:137)
| at (#7:1)
```
It seems like Java is not picking up that the UnionArray is Dense instead of
Sparse. After changing the default in
java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I
get this:
```
jshell> reader.getVectorSchemaRoot().getSchema()
$9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense,
[0])<: Int(64, true)>>>>>
```
but then reading doesn't work:
```
jshell> reader.loadNextBatch()
| java.lang.IllegalArgumentException thrown: Could not load buffers for field
list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<: Int(64,
true)>>>>. error message: can not truncate buffer to a larger size 1: 0
| at VectorLoader.loadBuffers (VectorLoader.java:83)
| at VectorLoader.load (VectorLoader.java:62)
| at ArrowReader$1.visit (ArrowReader.java:125)
| at ArrowReader$1.visit (ArrowReader.java:111)
| at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
| at ArrowReader.loadNextBatch (ArrowReader.java:137)
| at (#8:1)
```
Any help with this is appreciated!
> [Python, Java] UnionArray round trip not working
> ------------------------------------------------
>
> Key: ARROW-1692
> URL: https://issues.apache.org/jira/browse/ARROW-1692
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Philipp Moritz
> Attachments: union_array.arrow
>
>
> I'm currently working on making pyarrow.serialization data available from the
> Java side, one problem I was running into is that it seems the Java
> implementation cannot read UnionArrays generated from C++. To make this
> easily reproducible I created a clean Python implementation for creating
> UnionArrays: https://github.com/apache/arrow/pull/1216
> The data is generated with the following script:
> {code}
> import pyarrow as pa
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
> int64 = pa.array([1, 2, 3], type='int64')
> types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)
> batch = pa.RecordBatch.from_arrays([result], ["test"])
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> sink.close()
> b = sink.get_result()
> with open("union_array.arrow", "wb") as f:
> f.write(b)
> # Sanity check: Read the batch in again
> with open("union_array.arrow", "rb") as f:
> b = f.read()
> reader = pa.RecordBatchStreamReader(pa.BufferReader(b))
> batch = reader.read_next_batch()
> print("union array is", batch.column(0))
> {code}
> I attached the file generated by that script. Then when I run the following
> code in Java:
> {code}
> RootAllocator allocator = new RootAllocator(1000000000);
> ByteArrayInputStream in = new
> ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));
> ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
> reader.loadNextBatch()
> {code}
> I get the following error:
> {code}
> | java.lang.IllegalArgumentException thrown: Could not load buffers for
> field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error
> message: can not truncate buffer to a larger size 7: 0
> | at VectorLoader.loadBuffers (VectorLoader.java:83)
> | at VectorLoader.load (VectorLoader.java:62)
> | at ArrowReader$1.visit (ArrowReader.java:125)
> | at ArrowReader$1.visit (ArrowReader.java:111)
> | at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> | at ArrowReader.loadNextBatch (ArrowReader.java:137)
> | at (#7:1)
> {code}
> It seems like Java is not picking up that the UnionArray is Dense instead of
> Sparse. After changing the default in
> java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense,
> I get this:
> {code}
> jshell> reader.getVectorSchemaRoot().getSchema()
> $9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense,
> [0])<: Int(64, true)>>>>>
> {code}
> but then reading doesn't work:
> {code}
> jshell> reader.loadNextBatch()
> | java.lang.IllegalArgumentException thrown: Could not load buffers for
> field list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<:
> Int(64, true)>>>>. error message: can not truncate buffer to a larger size 1: > 0
> | at VectorLoader.loadBuffers (VectorLoader.java:83)
> | at VectorLoader.load (VectorLoader.java:62)
> | at ArrowReader$1.visit (ArrowReader.java:125)
> | at ArrowReader$1.visit (ArrowReader.java:111)
> | at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> | at ArrowReader.loadNextBatch (ArrowReader.java:137)
> | at (#8:1)
> {code}
> Any help with this is appreciated!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)