[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

BryanCutler Thu, 28 Dec 2017 22:22:27 -0800

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20114
  
    ping @ueshin @HyukjinKwon
    
    Unfortunately, there was a bug in the Arrow 0.8.0 release on the Java side 
https://issues.apache.org/jira/browse/ARROW-1948 that caused a problem here.  I 
was able to find a workaround, but it required me to make a change to the 
`ArrowVectorAccessor` class.  I'm not sure if this is something you would be ok 
putting in, or if you would prefer to wait until the next minor release to add 
the ArrayType support.
    
    The issue was that the Arrow spec states that if the validity buffer is 
empty, then that means that all the values are non-null.  In Arrow 0.8.0, the 
C++/Python side started sending buffers this way, and the Arrow ListVector was 
not handling it properly, thinking instead that there were no valid values.  
    
    The workaround I added here looks if the ListVector has a value count of > 
0 and has an empty validity buffer.  This means that all the values are 
non-null and it will allocate a new validity buffer with all bits set.
    
    For Arrow with non-udfs (toPandas and createDataFrame) this only needs to 
be done once, but for udfs each batch read will load new buffers into the arrow 
VectorSchemaRoot, so it needs to be checked after each read.  The simplest 
place to put the workaround to cover these cases was to allow 
`ArrowVectorAccessor.isNullAt(int rowId)` to be overridden.  Let me know what 
you guys think, thanks!



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20114: [SPARK-22530][PYTHON][SQL] Adding Arrow support for Arra...

Reply via email to