[ 
https://issues.apache.org/jira/browse/ARROW-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983266#comment-16983266
 ] 

Micah Kornfield commented on ARROW-7254:
----------------------------------------

{quote}I am sorry I do not fully understand the meaning of "100% performance 
penalty". IMO, the exact penalty for sparse vectors should be: setting the 
offsets buffer from {{lastSetIndex}} to {{valueCount}}.
{quote}
I think if one does setValueCount that call fillHoles and then if only set one 
element in a large list, setHoles will be called again to fix this?  Like I 
said I only looked at it quickly so I trust your analysis.

 

Columnar.rst is the wrong place for the documentation as it is implementation 
agnostic.  Ji Liu recently added prose documentation for java, we should add it 
there (the documentation should get published on the next release).

 

> BaseVariableWidthVector#setSafe appears to make value offsets inconsistent
> --------------------------------------------------------------------------
>
>                 Key: ARROW-7254
>                 URL: https://issues.apache.org/jira/browse/ARROW-7254
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 0.15.1
>            Reporter: David Li
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> The following program writes a file which PyArrow either segfaults (0.14.1) 
> or rejects with an error (0.15.1) {{pyarrow.lib.ArrowInvalid: Column 0: 
> Offset invariant failure at: 2 inconsistent value_offsets for null slot0!=4}} 
> on reading.
> Calling {{setRowCount}} again, or calling {{setSafe}} with a higher index 
> fixes it. While it seems from the new documentation that we should (must?) 
> call {{VectorSchemaRoot#setRowCount}} at the end, I wouldn't have expected to 
> get an invalid file by calling using {{setSafe}}, either. 
> Full traceback:
> {noformat}
> > python3 -c 'import pyarrow as pa; 
> > print(pa.ipc.open_stream(open("./test.bin", "rb")).read_pandas())'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File 
> "/Users/lidavidm/Flight/arrow-5137-auth/java/venv/lib/python3.7/site-packages/pyarrow/ipc.py",
>  line 46, in read_pandas
>     table = self.read_all()
>   File "pyarrow/ipc.pxi", line 330, in 
> pyarrow.lib._CRecordBatchReader.read_all
>   File "pyarrow/public-api.pxi", line 321, in pyarrow.lib.pyarrow_wrap_table
>   File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 0: Offset invariant failure at: 2 
> inconsistent value_offsets for null slot0!=4
> {noformat}
>  
> Full program:
> {code:java}
> import java.io.OutputStream;
> import java.nio.charset.StandardCharsets;
> import java.nio.file.Files;
> import java.nio.file.Paths;
> import java.util.Collections;
> import org.apache.arrow.memory.BufferAllocator;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.ipc.ArrowStreamWriter;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> public class AsdfTest {
>   public static void main(String[] args) throws Exception {
>     Schema schema = new Schema(Collections.singletonList(Field.nullable("a", 
> new ArrowType.Utf8())));
>     try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
>         VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) {
>       root.setRowCount(2);
>       VarCharVector v = (VarCharVector) root.getVector("a");
>       v.setSafe(0, "asdf".getBytes(StandardCharsets.UTF_8));
>       try (OutputStream output = 
> Files.newOutputStream(Paths.get("./test.bin"))) {
>         ArrowStreamWriter writer = new ArrowStreamWriter(root, null, output);
>         writer.writeBatch();
>         writer.close();
>       }
>     }
>   }
> }
> {code}
> {{v.setNull(1)}} after {{v.setSafe(0, "asdf")}} does not fix it. Using 
> {{set}} instead of {{setSafe}} will fail in Java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to