Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/20114
ping @ueshin @HyukjinKwon
Unfortunately, there was a bug in the Arrow 0.8.0 release on the Java side
https://issues.apache.org/jira/browse/ARROW-1948 that caused a problem here. I
was able to find a workaround, but it required me to make a change to the
`ArrowVectorAccessor` class. I'm not sure if this is something you would be ok
putting in, or if you would prefer to wait until the next minor release to add
the ArrayType support.
The issue was that the Arrow spec states that if the validity buffer is
empty, then that means that all the values are non-null. In Arrow 0.8.0, the
C++/Python side started sending buffers this way, and the Arrow ListVector was
not handling it properly, thinking instead that there were no valid values.
The workaround I added here looks if the ListVector has a value count of >
0 and has an empty validity buffer. This means that all the values are
non-null and it will allocate a new validity buffer with all bits set.
For Arrow with non-udfs (toPandas and createDataFrame) this only needs to
be done once, but for udfs each batch read will load new buffers into the arrow
VectorSchemaRoot, so it needs to be checked after each read. The simplest
place to put the workaround to cover these cases was to allow
`ArrowVectorAccessor.isNullAt(int rowId)` to be overridden. Let me know what
you guys think, thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]