[ 
https://issues.apache.org/jira/browse/ARROW-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982102#comment-16982102
 ] 

Liya Fan commented on ARROW-7254:
---------------------------------

[~lidavidm] Thanks a lot for the further clarification. 

After investigating the code, I think you have really found a bug in the Java 
code base!

The problem is that the write index of the data buffer is not actually set by 
the setXXX methods. This is not a problem with common operations, because they 
use the offset buffer instead of the write index to obtain the data positions.

For flighting, this is a problem, as the unloaded buffers use write index to 
determine how much data to write. Since the write index is not set, no data is 
actually written in the data buffer, and this makes the file layout invalid.

> BaseVariableWidthVector#setSafe appears to make value offsets inconsistent
> --------------------------------------------------------------------------
>
>                 Key: ARROW-7254
>                 URL: https://issues.apache.org/jira/browse/ARROW-7254
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 0.15.1
>            Reporter: David Li
>            Priority: Minor
>
> The following program writes a file which PyArrow either segfaults (0.14.1) 
> or rejects with an error (0.15.1) {{pyarrow.lib.ArrowInvalid: Column 0: 
> Offset invariant failure at: 2 inconsistent value_offsets for null slot0!=4}} 
> on reading.
> Calling {{setRowCount}} again, or calling {{setSafe}} with a higher index 
> fixes it. While it seems from the new documentation that we should (must?) 
> call {{VectorSchemaRoot#setRowCount}} at the end, I wouldn't have expected to 
> get an invalid file by calling using {{setSafe}}, either. 
> Full traceback:
> {noformat}
> > python3 -c 'import pyarrow as pa; 
> > print(pa.ipc.open_stream(open("./test.bin", "rb")).read_pandas())'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File 
> "/Users/lidavidm/Flight/arrow-5137-auth/java/venv/lib/python3.7/site-packages/pyarrow/ipc.py",
>  line 46, in read_pandas
>     table = self.read_all()
>   File "pyarrow/ipc.pxi", line 330, in 
> pyarrow.lib._CRecordBatchReader.read_all
>   File "pyarrow/public-api.pxi", line 321, in pyarrow.lib.pyarrow_wrap_table
>   File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 0: Offset invariant failure at: 2 
> inconsistent value_offsets for null slot0!=4
> {noformat}
>  
> Full program:
> {code:java}
> import java.io.OutputStream;
> import java.nio.charset.StandardCharsets;
> import java.nio.file.Files;
> import java.nio.file.Paths;
> import java.util.Collections;
> import org.apache.arrow.memory.BufferAllocator;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.ipc.ArrowStreamWriter;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> public class AsdfTest {
>   public static void main(String[] args) throws Exception {
>     Schema schema = new Schema(Collections.singletonList(Field.nullable("a", 
> new ArrowType.Utf8())));
>     try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
>         VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) {
>       root.setRowCount(2);
>       VarCharVector v = (VarCharVector) root.getVector("a");
>       v.setSafe(0, "asdf".getBytes(StandardCharsets.UTF_8));
>       try (OutputStream output = 
> Files.newOutputStream(Paths.get("./test.bin"))) {
>         ArrowStreamWriter writer = new ArrowStreamWriter(root, null, output);
>         writer.writeBatch();
>         writer.close();
>       }
>     }
>   }
> }
> {code}
> {{v.setNull(1)}} after {{v.setSafe(0, "asdf")}} does not fix it. Using 
> {{set}} instead of {{setSafe}} will fail in Java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to