[
https://issues.apache.org/jira/browse/ARROW-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982117#comment-16982117
]
Liya Fan commented on ARROW-7254:
---------------------------------
Additional investigation shows that the write index is not set because the
offset buffers for trailing values are not right (as indicated in the title of
this issue).
In addition, it seems the write index for data buffer is left unset
deliberately, for performance concerns. So we provide a simple fix to the
problem, to limit the impact to IPC only: we bring the vector to a consistent
state just before unloading.
Please see if it looks good. Thank you in advance.
> BaseVariableWidthVector#setSafe appears to make value offsets inconsistent
> --------------------------------------------------------------------------
>
> Key: ARROW-7254
> URL: https://issues.apache.org/jira/browse/ARROW-7254
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java
> Affects Versions: 0.15.1
> Reporter: David Li
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The following program writes a file which PyArrow either segfaults (0.14.1)
> or rejects with an error (0.15.1) {{pyarrow.lib.ArrowInvalid: Column 0:
> Offset invariant failure at: 2 inconsistent value_offsets for null slot0!=4}}
> on reading.
> Calling {{setRowCount}} again, or calling {{setSafe}} with a higher index
> fixes it. While it seems from the new documentation that we should (must?)
> call {{VectorSchemaRoot#setRowCount}} at the end, I wouldn't have expected to
> get an invalid file by calling using {{setSafe}}, either.
> Full traceback:
> {noformat}
> > python3 -c 'import pyarrow as pa;
> > print(pa.ipc.open_stream(open("./test.bin", "rb")).read_pandas())'
> Traceback (most recent call last):
> File "<string>", line 1, in <module>
> File
> "/Users/lidavidm/Flight/arrow-5137-auth/java/venv/lib/python3.7/site-packages/pyarrow/ipc.py",
> line 46, in read_pandas
> table = self.read_all()
> File "pyarrow/ipc.pxi", line 330, in
> pyarrow.lib._CRecordBatchReader.read_all
> File "pyarrow/public-api.pxi", line 321, in pyarrow.lib.pyarrow_wrap_table
> File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 0: Offset invariant failure at: 2
> inconsistent value_offsets for null slot0!=4
> {noformat}
>
> Full program:
> {code:java}
> import java.io.OutputStream;
> import java.nio.charset.StandardCharsets;
> import java.nio.file.Files;
> import java.nio.file.Paths;
> import java.util.Collections;
> import org.apache.arrow.memory.BufferAllocator;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.ipc.ArrowStreamWriter;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> public class AsdfTest {
> public static void main(String[] args) throws Exception {
> Schema schema = new Schema(Collections.singletonList(Field.nullable("a",
> new ArrowType.Utf8())));
> try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
> VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) {
> root.setRowCount(2);
> VarCharVector v = (VarCharVector) root.getVector("a");
> v.setSafe(0, "asdf".getBytes(StandardCharsets.UTF_8));
> try (OutputStream output =
> Files.newOutputStream(Paths.get("./test.bin"))) {
> ArrowStreamWriter writer = new ArrowStreamWriter(root, null, output);
> writer.writeBatch();
> writer.close();
> }
> }
> }
> }
> {code}
> {{v.setNull(1)}} after {{v.setSafe(0, "asdf")}} does not fix it. Using
> {{set}} instead of {{setSafe}} will fail in Java.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)