Re: Support for DELTA_LENGTH_BYTE_ARRAY?

Kyle Bendickson Tue, 20 Jul 2021 00:15:11 -0700

Oh my apologies -

Definitely there’s substantial benefit from using vectorized encoding, both in 
terms of performance (though I don’t have numbers off hand), and additionally 
with interoperability.

I admittedly have not directly tried it with PyArrow 4.

Work on this would be great though. It would be ideal if the entirety of the 
parquet writer v2 file format could be read, including with vectorization 
enabled.

Here is the Spark code in the latest master branch: 
https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet

Trino supports it, so I would also check there. I found this open issue for 
“optimized” Parquet writer, and the next ticket (per this doc) is related to 
Spark interop and the one after that is related to V2 file format and “other 
engines”: https://github.com/trinodb/trino/issues/6382

There has been some work recently to support more vectorization options and 
update the error messages.

If there’s not an open issue in on the Iceberg GitHub, can you open one? We 
could potentially link to the Trino open issue as well, as this issue affects 
Iceberg users on both platforms :)

Let me know if there is any further way I can be of help :)
Kyle


Kyle Bendickson
Software Engineer
Apple
ACS Data
One Apple Park Way,
Cupertino, CA 95014, USA
[email protected]

This email and any attachments may be privileged and may contain confidential 
information intended only for the recipient(s) named above. Any other 
distribution, forwarding, copying or disclosure of this message is strictly 
prohibited. If you have received this email in error, please notify me 
immediately by telephone or return email, and delete this message from your 
system.

> On Jul 20, 2021, at 12:04 AM, Kyle Bendickson <[email protected]> 
> wrote:
> 
> Hi Joseph.
> 
> DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature.
> 
> I do not believe that the Spark ParquetReaders implement page v2 at the 
> moment (this might be what you mean when you say you’re working on it).
> 
> For files generated with DELTA_BYTE_ARRAY_ENCODING (e.g. from Tino), I’ve 
> been able to get around it by disabling vectorized parquet reads. This has 
> solved the immediate problem of making the files readable for me, though of 
> course it comes with the disadvantage of not getting vectorized reads.
> 
> Try changing this setting to false: spark.sql.iceberg.vectorization.enabled
> 
> DELTA_LENGTH_BYTE_ARRAY encoding is a parquet writer v2 feature. I do not 
> believe that the Spark ParquetReaders implement page v2 at the moment (this 
> might be what you mean when you say you’re working on it).
> 
> Here’s an issue that should get you some more insight: 
> https://github.com/apache/iceberg/issues/2692
> 
> Let me know if that answers your question!
> 
> 
> Kyle Bendickson
> Software Engineer
> Apple
> ACS Data
> One Apple Park Way,
> Cupertino, CA 95014, USA
> [email protected]
> 
> This email and any attachments may be privileged and may contain confidential 
> information intended only for the recipient(s) named above. Any other 
> distribution, forwarding, copying or disclosure of this message is strictly 
> prohibited. If you have received this email in error, please notify me 
> immediately by telephone or return email, and delete this message from your 
> system.
> 
> 
>> On Jul 19, 2021, at 9:28 PM, Jorge Cardoso Leitão <[email protected]> 
>> wrote:
>> 
>> Hi,
>> 
>> I am trying to add support for DELTA_LENGTH_BYTE_ARRAY in a package, but I
>> am struggling to find readers of it, despite the fact that the spec states
>> "This encoding is always preferred over PLAIN for byte array columns.".
>> 
>> * spark 3.X: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY
>> * pyarrow 4: OSError: Not yet implemented: Unsupported encoding.
>> 
>> Is there any minimal preferred encodings or people just ignore encodings
>> and use either PLAIN or dict? Or are the encodings just not so much
>> supported because they do bring sufficient benefits?
>> 
>> Could someone offer some context to the situation?
>> 
>> Best,
>> Jorge
>

Re: Support for DELTA_LENGTH_BYTE_ARRAY?

Reply via email to