Hi Jorge, This has been discussed previously, there is an open PR [1] which has lost some steam to try to figure out minimally supported features.
-Micah [1] https://github.com/apache/parquet-format/pull/164 On Tue, Jul 20, 2021 at 1:15 AM Kyle Bendickson <[email protected]> wrote: > Oh my apologies - > > Definitely there’s substantial benefit from using vectorized encoding, > both in terms of performance (though I don’t have numbers off hand), and > additionally with interoperability. > > I admittedly have not directly tried it with PyArrow 4. > > Work on this would be great though. It would be ideal if the entirety of > the parquet writer v2 file format could be read, including with > vectorization enabled. > > Here is the Spark code in the latest master branch: > https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet > > Trino supports it, so I would also check there. I found this open issue > for “optimized” Parquet writer, and the next ticket (per this doc) is > related to Spark interop and the one after that is related to V2 file > format and “other engines”: https://github.com/trinodb/trino/issues/6382 > > There has been some work recently to support more vectorization options > and update the error messages. > > If there’s not an open issue in on the Iceberg GitHub, can you open one? > We could potentially link to the Trino open issue as well, as this issue > affects Iceberg users on both platforms :) > > Let me know if there is any further way I can be of help :) > Kyle > > > Kyle Bendickson > Software Engineer > Apple > ACS Data > One Apple Park Way, > Cupertino, CA 95014, USA > [email protected] > > This email and any attachments may be privileged and may contain > confidential information intended only for the recipient(s) named above. > Any other distribution, forwarding, copying or disclosure of this message > is strictly prohibited. If you have received this email in error, please > notify me immediately by telephone or return email, and delete this message > from your system. > > > > On Jul 20, 2021, at 12:04 AM, Kyle Bendickson > <[email protected]> wrote: > > > > Hi Joseph. > > > > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. > > > > I do not believe that the Spark ParquetReaders implement page v2 at the > moment (this might be what you mean when you say you’re working on it). > > > > For files generated with DELTA_BYTE_ARRAY_ENCODING (e.g. from Tino), > I’ve been able to get around it by disabling vectorized parquet reads. This > has solved the immediate problem of making the files readable for me, > though of course it comes with the disadvantage of not getting vectorized > reads. > > > > Try changing this setting to false: > spark.sql.iceberg.vectorization.enabled > > > > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. I do > not believe that the Spark ParquetReaders implement page v2 at the moment > (this might be what you mean when you say you’re working on it). > > > > Here’s an issue that should get you some more insight: > https://github.com/apache/iceberg/issues/2692 > > > > Let me know if that answers your question! > > > > > > Kyle Bendickson > > Software Engineer > > Apple > > ACS Data > > One Apple Park Way, > > Cupertino, CA 95014, USA > > [email protected] > > > > This email and any attachments may be privileged and may contain > confidential information intended only for the recipient(s) named above. > Any other distribution, forwarding, copying or disclosure of this message > is strictly prohibited. If you have received this email in error, please > notify me immediately by telephone or return email, and delete this message > from your system. > > > > > >> On Jul 19, 2021, at 9:28 PM, Jorge Cardoso Leitão < > [email protected]> wrote: > >> > >> Hi, > >> > >> I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package, > but I > >> am struggling to find readers of it, despite the fact that the spec > states > >> "This encoding is always preferred over PLAIN for byte array columns.". > >> > >> * spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY > >> * pyarrow 4: OSError: Not yet implemented: Unsupported encoding. > >> > >> Is there any minimal preferred encodings or people just ignore encodings > >> and use either PLAIN or dict? Or are the encodings just not so much > >> supported because they do bring sufficient benefits? > >> > >> Could someone offer some context to the situation? > >> > >> Best, > >> Jorge > > > >
