Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20114 ping @ueshin @HyukjinKwon Unfortunately, there was a bug in the Arrow 0.8.0 release on the Java side https://issues.apache.org/jira/browse/ARROW-1948 that caused a problem here. I was able to find a workaround, but it required me to make a change to the `ArrowVectorAccessor` class. I'm not sure if this is something you would be ok putting in, or if you would prefer to wait until the next minor release to add the ArrayType support. The issue was that the Arrow spec states that if the validity buffer is empty, then that means that all the values are non-null. In Arrow 0.8.0, the C++/Python side started sending buffers this way, and the Arrow ListVector was not handling it properly, thinking instead that there were no valid values. The workaround I added here looks if the ListVector has a value count of > 0 and has an empty validity buffer. This means that all the values are non-null and it will allocate a new validity buffer with all bits set. For Arrow with non-udfs (toPandas and createDataFrame) this only needs to be done once, but for udfs each batch read will load new buffers into the arrow VectorSchemaRoot, so it needs to be checked after each read. The simplest place to put the workaround to cover these cases was to allow `ArrowVectorAccessor.isNullAt(int rowId)` to be overridden. Let me know what you guys think, thanks!
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org