Re: Improving Parquet Dedupe

Andrew Lamb Thu, 10 Oct 2024 03:01:06 -0700

(I see now this is probably what you mean by "implement a specialized
format pre-process to modify those offsets for storage purposes")


On Thu, Oct 10, 2024 at 5:57 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> > It is not inherently aware of the Parquet file structure, and strive to
> store
> things byte-for-byte identical so we don't have to deal with parsers,
> malformed
> files, new formats, etc.
>
> I see -- thank you -- this is the key detail I didn't understand.
>
> I wonder if you could apply some normalization to the file prior to
> deduplicating them (aka could you update your hash calculation so it
> zero'ed out the relative offsets in a parquet files before checking for
> equality? That would require applying some special case based on file
> format, but the code is likely relatively simple
>
>
>
>
> On Wed, Oct 9, 2024 at 9:05 PM Yucheng Low <y...@huggingface.co> wrote:
>
>> Hi Andrew! Have not seen you in a while!
>>
>> Back on topic,
>>
>> The deduplication procedure we are using is file-type independent and
>> simply chunks the file into variable-sized chunks averaging ~ 64KB.
>> It is not inherently aware of the Parquet file structure, and strive to
>> store
>> things byte-for-byte identical so we don't have to deal with parsers,
>> malformed
>> files, new formats, etc.
>>
>> Also, we operate (like git) on a snapshot basis. We are not storing
>> information
>> about how a file changed as we do not have that information, nor do we
>> want
>> to try to derive it. If we know the operations that changed the file,
>> Iceberg
>> will be the ideal solution I imagine. As such we need to try to identify
>> common byte sequences which already exist "somewhere" in our system
>> and dedupe accordingly. In a sense, what we are trying to do is
>> orthogonal
>> to Iceberg. (Deltas vs snapshots).
>>
>> However, the "file_offset" fields in RowGroup and ColumnChunk are not
>> position independent in the file and so result in significant
>> fragmentation,
>> and for files with small row groups, poor deduplication.
>>
>> We could of course implement a specialized format pre-process to modify
>> those offsets for storage purposes, but in my mind that is probably
>> remarkably
>> difficult to make resilient, where the goal is byte-for-byte identical.
>>
>> While we may just have to accept it for the current Parquet format, (we
>> have
>> some tricks to deal with fragmentation). if there are plans on updating
>> the
>> Parquet format, addressing the issue at the format layer (switch all
>> absolute
>> offsets to relative) will be great and may have future benefits here as
>> well.
>>
>> Thanks,
>> Yucheng
>>
>> > On Oct 9, 2024, at 5:38 PM, Andrew Lamb <andrewlam...@gmail.com> wrote:
>> >
>> > I am sorry for the likely dumb question, but I think I am missing
>> something
>> >
>> > The blog post says " This means that any modification is likely to
>> rewrite
>> > all the Column headers."
>> >
>> > But my understanding of the parquet format is that the ColumnChunks[1]
>> are
>> > stored inline with the RowGroups which are stored in the footer.
>> >
>> > Thus I would expect that a parquet deduplication process could copy the
>> > data for each row group memcpy style, and write a new footer with
>> updated
>> > offsets. This doesn't require rewriting the entire file, simply
>> adjusting
>> > offsets and writing a new footer.
>> >
>> > Andrew
>> >
>> >
>> > [1]
>> >
>> https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918
>> >
>> >
>> > On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <y...@huggingface.co> wrote:
>> >
>> >> Hi,
>> >>
>> >> I am the author of the blog here!
>> >> Happy to answer any questions.
>> >>
>> >> There are a couple of parts, one is regarding relative pointers and a
>> >> second is the row group chunking system (which for performance purposes
>> >> could benefit from being implemented in the C/C++ layer). I am happy to
>> >> help where I can with the latter as that can be done with the current
>> >> Parquet version too.
>> >>
>> >> Thanks,
>> >> Yucheng
>> >>
>> >> On 2024/10/09 15:46:01 Julien Le Dem wrote:
>> >>> I recommended to them that they join the dev list. I think that's the
>> >>> easiest to discuss.
>> >>> IMO, it's a good goal to have relative pointers in the metadata so
>> that a
>> >>> row group doesn't depend on where it is in a file.
>> >>> It looks like some aspects of making the data updates more incremental
>> >>> could leverage Iceberg.
>> >>>
>> >>> On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <an...@python.org>
>> wrote:
>> >>>
>> >>>>
>> >>>> I have a contact at Hugging Face who actually notified me of the blog
>> >>>> post. I can transmit any questions or suggestions if desired.
>> >>>>
>> >>>> Regards
>> >>>>
>> >>>> Antoine.
>> >>>>
>> >>>>
>> >>>> On Wed, 9 Oct 2024 11:50:51 +0100
>> >>>> Steve Loughran
>> >>>> <st...@cloudera.com.INVALID> wrote:
>> >>>>> flatbuffer would be the obvious place that would be no compatibility
>> >>>> issues
>> >>>>> with existing readers.
>> >>>>>
>> >>>>> Also: that looks like a large amount of information to capture
>> >> statistics
>> >>>>> on. Has anyone approached them yet?
>> >>>>>
>> >>>>> On Wed, 9 Oct 2024 at 03:39, Gang Wu <
>> >>>> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
>> >>>>>
>> >>>>>> Thanks Antoine for sharing the blog post!
>> >>>>>>
>> >>>>>> I skimmed it quickly and it seems that the main issue is the
>> >> absolute
>> >>>>>> file offset used by metadata of page and column chunk. It may take
>> >> a
>> >>>>>> long time to migrate if we want to replace them with relative
>> >> offsets
>> >>>>>> in the current thrift definition. Perhaps it is a good chance to
>> >>>> improve
>> >>>>>> this with the current flatbuffer experiment?
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Gang
>> >>>>>>
>> >>>>>> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <
>> >>>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Hello,
>> >>>>>>>
>> >>>>>>> The Hugging Face developers published this insightful blog post
>> >> about
>> >>>>>>> their attempts to deduplicate Parquet files when they have
>> >> similar
>> >>>>>>> contents. They offer a couple suggestions for improvement at the
>> >> end:
>> >>>>>>> https://huggingface.co/blog/improve_parquet_dedupe
>> >>>>>>>
>> >>>>>>> Regards
>> >>>>>>>
>> >>>>>>> Antoine.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>>
>>

Re: Improving Parquet Dedupe

Reply via email to