[ 
https://issues.apache.org/jira/browse/ARROW-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982057#comment-16982057
 ] 

David Li commented on ARROW-7254:
---------------------------------

Thank you [~fan_li_ya] for the explanation. I think what I am confused by is 
that if you call {{setRowCount}} and do not manipulate the vector, then it 
writes valid data, but calling {{setSafe}} invalidates the vector's _offset_ 
buffer, as the PyArrow error message notes. (Additionally, calling 
{{setSafe(0); setNull(1)}} still leads to this error - it seems {{setNull}} 
does not preserve the same invariants as {{setSafe}}.) But if the expectation 
is that {{setRowCount}} must always be the last operation, then I think we can 
close this.

I reproduced this after looking at internal code that ran into this error; I 
think once we have published documentation on how to use VectorSchemaRoot, it 
will be easier to avoid this, but it seems unfortunate that the Java API makes 
it easy to get things wrong.

> BaseVariableWidthVector#setSafe appears to make value offsets inconsistent
> --------------------------------------------------------------------------
>
>                 Key: ARROW-7254
>                 URL: https://issues.apache.org/jira/browse/ARROW-7254
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 0.15.1
>            Reporter: David Li
>            Priority: Minor
>
> The following program writes a file which PyArrow either segfaults (0.14.1) 
> or rejects with an error (0.15.1) {{pyarrow.lib.ArrowInvalid: Column 0: 
> Offset invariant failure at: 2 inconsistent value_offsets for null slot0!=4}} 
> on reading.
> Calling {{setRowCount}} again, or calling {{setSafe}} with a higher index 
> fixes it. While it seems from the new documentation that we should (must?) 
> call {{VectorSchemaRoot#setRowCount}} at the end, I wouldn't have expected to 
> get an invalid file by calling using {{setSafe}}, either. 
> Full traceback:
> {noformat}
> > python3 -c 'import pyarrow as pa; 
> > print(pa.ipc.open_stream(open("./test.bin", "rb")).read_pandas())'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File 
> "/Users/lidavidm/Flight/arrow-5137-auth/java/venv/lib/python3.7/site-packages/pyarrow/ipc.py",
>  line 46, in read_pandas
>     table = self.read_all()
>   File "pyarrow/ipc.pxi", line 330, in 
> pyarrow.lib._CRecordBatchReader.read_all
>   File "pyarrow/public-api.pxi", line 321, in pyarrow.lib.pyarrow_wrap_table
>   File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 0: Offset invariant failure at: 2 
> inconsistent value_offsets for null slot0!=4
> {noformat}
>  
> Full program:
> {code:java}
> import java.io.OutputStream;
> import java.nio.charset.StandardCharsets;
> import java.nio.file.Files;
> import java.nio.file.Paths;
> import java.util.Collections;
> import org.apache.arrow.memory.BufferAllocator;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.ipc.ArrowStreamWriter;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> public class AsdfTest {
>   public static void main(String[] args) throws Exception {
>     Schema schema = new Schema(Collections.singletonList(Field.nullable("a", 
> new ArrowType.Utf8())));
>     try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
>         VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) {
>       root.setRowCount(2);
>       VarCharVector v = (VarCharVector) root.getVector("a");
>       v.setSafe(0, "asdf".getBytes(StandardCharsets.UTF_8));
>       try (OutputStream output = 
> Files.newOutputStream(Paths.get("./test.bin"))) {
>         ArrowStreamWriter writer = new ArrowStreamWriter(root, null, output);
>         writer.writeBatch();
>         writer.close();
>       }
>     }
>   }
> }
> {code}
> {{v.setNull(1)}} after {{v.setSafe(0, "asdf")}} does not fix it. Using 
> {{set}} instead of {{setSafe}} will fail in Java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to