Oh my apologies - Definitely there’s substantial benefit from using vectorized encoding, both in terms of performance (though I don’t have numbers off hand), and additionally with interoperability.
I admittedly have not directly tried it with PyArrow 4. Work on this would be great though. It would be ideal if the entirety of the parquet writer v2 file format could be read, including with vectorization enabled. Here is the Spark code in the latest master branch: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet Trino supports it, so I would also check there. I found this open issue for “optimized” Parquet writer, and the next ticket (per this doc) is related to Spark interop and the one after that is related to V2 file format and “other engines”: https://github.com/trinodb/trino/issues/6382 There has been some work recently to support more vectorization options and update the error messages. If there’s not an open issue in on the Iceberg GitHub, can you open one? We could potentially link to the Trino open issue as well, as this issue affects Iceberg users on both platforms :) Let me know if there is any further way I can be of help :) Kyle Kyle Bendickson Software Engineer Apple ACS Data One Apple Park Way, Cupertino, CA 95014, USA [email protected] This email and any attachments may be privileged and may contain confidential information intended only for the recipient(s) named above. Any other distribution, forwarding, copying or disclosure of this message is strictly prohibited. If you have received this email in error, please notify me immediately by telephone or return email, and delete this message from your system. > On Jul 20, 2021, at 12:04 AM, Kyle Bendickson <[email protected]> > wrote: > > Hi Joseph. > > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. > > I do not believe that the Spark ParquetReaders implement page v2 at the > moment (this might be what you mean when you say you’re working on it). > > For files generated with DELTA_BYTE_ARRAY_ENCODING (e.g. from Tino), I’ve > been able to get around it by disabling vectorized parquet reads. This has > solved the immediate problem of making the files readable for me, though of > course it comes with the disadvantage of not getting vectorized reads. > > Try changing this setting to false: spark.sql.iceberg.vectorization.enabled > > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. I do not > believe that the Spark ParquetReaders implement page v2 at the moment (this > might be what you mean when you say you’re working on it). > > Here’s an issue that should get you some more insight: > https://github.com/apache/iceberg/issues/2692 > > Let me know if that answers your question! > > > Kyle Bendickson > Software Engineer > Apple > ACS Data > One Apple Park Way, > Cupertino, CA 95014, USA > [email protected] > > This email and any attachments may be privileged and may contain confidential > information intended only for the recipient(s) named above. Any other > distribution, forwarding, copying or disclosure of this message is strictly > prohibited. If you have received this email in error, please notify me > immediately by telephone or return email, and delete this message from your > system. > > >> On Jul 19, 2021, at 9:28 PM, Jorge Cardoso Leitão <[email protected]> >> wrote: >> >> Hi, >> >> I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package, but I >> am struggling to find readers of it, despite the fact that the spec states >> "This encoding is always preferred over PLAIN for byte array columns.". >> >> * spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY >> * pyarrow 4: OSError: Not yet implemented: Unsupported encoding. >> >> Is there any minimal preferred encodings or people just ignore encodings >> and use either PLAIN or dict? Or are the encodings just not so much >> supported because they do bring sufficient benefits? >> >> Could someone offer some context to the situation? >> >> Best, >> Jorge >
