Re: Support for DELTA_LENGTH_BYTE_ARRAY?

Micah Kornfield Tue, 20 Jul 2021 08:41:00 -0700

Hi Jorge,
This has been discussed previously, there is an open PR [1] which has lost
some steam to try to figure out minimally supported features.


-Micah

[1] https://github.com/apache/parquet-format/pull/164

On Tue, Jul 20, 2021 at 1:15 AM Kyle Bendickson
<[email protected]> wrote:

> Oh my apologies -
>
> Definitely there’s substantial benefit from using vectorized encoding,
> both in terms of performance (though I don’t have numbers off hand), and
> additionally with interoperability.
>
> I admittedly have not directly tried it with PyArrow 4.
>
> Work on this would be great though. It would be ideal if the entirety of
> the parquet writer v2 file format could be read, including with
> vectorization enabled.
>
> Here is the Spark code in the latest master branch:
> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet
>
> Trino supports it, so I would also check there. I found this open issue
> for “optimized” Parquet writer, and the next ticket (per this doc) is
> related to Spark interop and the one after that is related to V2 file
> format and “other engines”: https://github.com/trinodb/trino/issues/6382
>
> There has been some work recently to support more vectorization options
> and update the error messages.
>
> If there’s not an open issue in on the Iceberg GitHub, can you open one?
> We could potentially link to the Trino open issue as well, as this issue
> affects Iceberg users on both platforms :)
>
> Let me know if there is any further way I can be of help :)
> Kyle
> 
>
> Kyle Bendickson
> Software Engineer
> Apple
> ACS Data
> One Apple Park Way,
> Cupertino, CA 95014, USA
> [email protected]
>
> This email and any attachments may be privileged and may contain
> confidential information intended only for the recipient(s) named above.
> Any other distribution, forwarding, copying or disclosure of this message
> is strictly prohibited. If you have received this email in error, please
> notify me immediately by telephone or return email, and delete this message
> from your system.
>
>
> > On Jul 20, 2021, at 12:04 AM, Kyle Bendickson
> <[email protected]> wrote:
> >
> > Hi Joseph.
> >
> > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature.
> >
> > I do not believe that the Spark ParquetReaders implement page v2 at the
> moment (this might be what you mean when you say you’re working on it).
> >
> > For files generated with DELTA_BYTE_ARRAY_ENCODING (e.g. from Tino),
> I’ve been able to get around it by disabling vectorized parquet reads. This
> has solved the immediate problem of making the files readable for me,
> though of course it comes with the disadvantage of not getting vectorized
> reads.
> >
> > Try changing this setting to false:
> spark.sql.iceberg.vectorization.enabled
> >
> > DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. I do
> not believe that the Spark ParquetReaders implement page v2 at the moment
> (this might be what you mean when you say you’re working on it).
> >
> > Here’s an issue that should get you some more insight:
> https://github.com/apache/iceberg/issues/2692
> >
> > Let me know if that answers your question!
> > 
> >
> > Kyle Bendickson
> > Software Engineer
> > Apple
> > ACS Data
> > One Apple Park Way,
> > Cupertino, CA 95014, USA
> > [email protected]
> >
> > This email and any attachments may be privileged and may contain
> confidential information intended only for the recipient(s) named above.
> Any other distribution, forwarding, copying or disclosure of this message
> is strictly prohibited. If you have received this email in error, please
> notify me immediately by telephone or return email, and delete this message
> from your system.
> >
> >
> >> On Jul 19, 2021, at 9:28 PM, Jorge Cardoso Leitão <
> [email protected]> wrote:
> >>
> >> Hi,
> >>
> >> I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package,
> but I
> >> am struggling to find readers of it, despite the fact that the spec
> states
> >> "This encoding is always preferred over PLAIN for byte array columns.".
> >>
> >> * spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY
> >> * pyarrow 4: OSError: Not yet implemented: Unsupported encoding.
> >>
> >> Is there any minimal preferred encodings or people just ignore encodings
> >> and use either PLAIN or dict? Or are the encodings just not so much
> >> supported because they do bring sufficient benefits?
> >>
> >> Could someone offer some context to the situation?
> >>
> >> Best,
> >> Jorge
> >
>
>

Re: Support for DELTA_LENGTH_BYTE_ARRAY?

Reply via email to