Re: Improving Parquet Dedupe

Yucheng Low Wed, 09 Oct 2024 18:05:51 -0700

Hi Andrew! Have not seen you in a while!

Back on topic,

The deduplication procedure we are using is file-type independent and
simply chunks the file into variable-sized chunks averaging ~ 64KB.
It is not inherently aware of the Parquet file structure, and strive to store
things byte-for-byte identical so we don't have to deal with parsers, malformed
files, new formats, etc. 

Also, we operate (like git) on a snapshot basis. We are not storing information
about how a file changed as we do not have that information, nor do we want
to try to derive it. If we know the operations that changed the file, Iceberg
will be the ideal solution I imagine. As such we need to try to identify 
common byte sequences which already exist "somewhere" in our system 
and dedupe accordingly. In a sense, what we are trying to do is orthogonal 
to Iceberg. (Deltas vs snapshots). 

However, the "file_offset" fields in RowGroup and ColumnChunk are not 
position independent in the file and so result in significant fragmentation,
and for files with small row groups, poor deduplication. 

We could of course implement a specialized format pre-process to modify 
those offsets for storage purposes, but in my mind that is probably remarkably
difficult to make resilient, where the goal is byte-for-byte identical. 

While we may just have to accept it for the current Parquet format, (we have 
some tricks to deal with fragmentation). if there are plans on updating the 
Parquet format, addressing the issue at the format layer (switch all absolute
offsets to relative) will be great and may have future benefits here as well.

Thanks,
Yucheng

> On Oct 9, 2024, at 5:38 PM, Andrew Lamb <andrewlam...@gmail.com> wrote:
> 
> I am sorry for the likely dumb question, but I think I am missing something
> 
> The blog post says " This means that any modification is likely to rewrite
> all the Column headers."
> 
> But my understanding of the parquet format is that the ColumnChunks[1] are
> stored inline with the RowGroups which are stored in the footer.
> 
> Thus I would expect that a parquet deduplication process could copy the
> data for each row group memcpy style, and write a new footer with updated
> offsets. This doesn't require rewriting the entire file, simply adjusting
> offsets and writing a new footer.
> 
> Andrew
> 
> 
> [1]
> https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918
> 
> 
> On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <y...@huggingface.co> wrote:
> 
>> Hi,
>> 
>> I am the author of the blog here!
>> Happy to answer any questions.
>> 
>> There are a couple of parts, one is regarding relative pointers and a
>> second is the row group chunking system (which for performance purposes
>> could benefit from being implemented in the C/C++ layer). I am happy to
>> help where I can with the latter as that can be done with the current
>> Parquet version too.
>> 
>> Thanks,
>> Yucheng
>> 
>> On 2024/10/09 15:46:01 Julien Le Dem wrote:
>>> I recommended to them that they join the dev list. I think that's the
>>> easiest to discuss.
>>> IMO, it's a good goal to have relative pointers in the metadata so that a
>>> row group doesn't depend on where it is in a file.
>>> It looks like some aspects of making the data updates more incremental
>>> could leverage Iceberg.
>>> 
>>> On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <an...@python.org> wrote:
>>> 
>>>> 
>>>> I have a contact at Hugging Face who actually notified me of the blog
>>>> post. I can transmit any questions or suggestions if desired.
>>>> 
>>>> Regards
>>>> 
>>>> Antoine.
>>>> 
>>>> 
>>>> On Wed, 9 Oct 2024 11:50:51 +0100
>>>> Steve Loughran
>>>> <st...@cloudera.com.INVALID> wrote:
>>>>> flatbuffer would be the obvious place that would be no compatibility
>>>> issues
>>>>> with existing readers.
>>>>> 
>>>>> Also: that looks like a large amount of information to capture
>> statistics
>>>>> on. Has anyone approached them yet?
>>>>> 
>>>>> On Wed, 9 Oct 2024 at 03:39, Gang Wu <
>>>> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
>>>>> 
>>>>>> Thanks Antoine for sharing the blog post!
>>>>>> 
>>>>>> I skimmed it quickly and it seems that the main issue is the
>> absolute
>>>>>> file offset used by metadata of page and column chunk. It may take
>> a
>>>>>> long time to migrate if we want to replace them with relative
>> offsets
>>>>>> in the current thrift definition. Perhaps it is a good chance to
>>>> improve
>>>>>> this with the current flatbuffer experiment?
>>>>>> 
>>>>>> Best,
>>>>>> Gang
>>>>>> 
>>>>>> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <
>>>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> The Hugging Face developers published this insightful blog post
>> about
>>>>>>> their attempts to deduplicate Parquet files when they have
>> similar
>>>>>>> contents. They offer a couple suggestions for improvement at the
>> end:
>>>>>>> https://huggingface.co/blog/improve_parquet_dedupe
>>>>>>> 
>>>>>>> Regards
>>>>>>> 
>>>>>>> Antoine.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>

Re: Improving Parquet Dedupe

Reply via email to