> It is not inherently aware of the Parquet file structure, and strive to
store
things byte-for-byte identical so we don't have to deal with parsers,
malformed
files, new formats, etc.

I see -- thank you -- this is the key detail I didn't understand.

I wonder if you could apply some normalization to the file prior to
deduplicating them (aka could you update your hash calculation so it
zero'ed out the relative offsets in a parquet files before checking for
equality? That would require applying some special case based on file
format, but the code is likely relatively simple




On Wed, Oct 9, 2024 at 9:05 PM Yucheng Low <y...@huggingface.co> wrote:

> Hi Andrew! Have not seen you in a while!
>
> Back on topic,
>
> The deduplication procedure we are using is file-type independent and
> simply chunks the file into variable-sized chunks averaging ~ 64KB.
> It is not inherently aware of the Parquet file structure, and strive to
> store
> things byte-for-byte identical so we don't have to deal with parsers,
> malformed
> files, new formats, etc.
>
> Also, we operate (like git) on a snapshot basis. We are not storing
> information
> about how a file changed as we do not have that information, nor do we want
> to try to derive it. If we know the operations that changed the file,
> Iceberg
> will be the ideal solution I imagine. As such we need to try to identify
> common byte sequences which already exist "somewhere" in our system
> and dedupe accordingly. In a sense, what we are trying to do is orthogonal
> to Iceberg. (Deltas vs snapshots).
>
> However, the "file_offset" fields in RowGroup and ColumnChunk are not
> position independent in the file and so result in significant
> fragmentation,
> and for files with small row groups, poor deduplication.
>
> We could of course implement a specialized format pre-process to modify
> those offsets for storage purposes, but in my mind that is probably
> remarkably
> difficult to make resilient, where the goal is byte-for-byte identical.
>
> While we may just have to accept it for the current Parquet format, (we
> have
> some tricks to deal with fragmentation). if there are plans on updating
> the
> Parquet format, addressing the issue at the format layer (switch all
> absolute
> offsets to relative) will be great and may have future benefits here as
> well.
>
> Thanks,
> Yucheng
>
> > On Oct 9, 2024, at 5:38 PM, Andrew Lamb <andrewlam...@gmail.com> wrote:
> >
> > I am sorry for the likely dumb question, but I think I am missing
> something
> >
> > The blog post says " This means that any modification is likely to
> rewrite
> > all the Column headers."
> >
> > But my understanding of the parquet format is that the ColumnChunks[1]
> are
> > stored inline with the RowGroups which are stored in the footer.
> >
> > Thus I would expect that a parquet deduplication process could copy the
> > data for each row group memcpy style, and write a new footer with updated
> > offsets. This doesn't require rewriting the entire file, simply adjusting
> > offsets and writing a new footer.
> >
> > Andrew
> >
> >
> > [1]
> >
> https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918
> >
> >
> > On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <y...@huggingface.co> wrote:
> >
> >> Hi,
> >>
> >> I am the author of the blog here!
> >> Happy to answer any questions.
> >>
> >> There are a couple of parts, one is regarding relative pointers and a
> >> second is the row group chunking system (which for performance purposes
> >> could benefit from being implemented in the C/C++ layer). I am happy to
> >> help where I can with the latter as that can be done with the current
> >> Parquet version too.
> >>
> >> Thanks,
> >> Yucheng
> >>
> >> On 2024/10/09 15:46:01 Julien Le Dem wrote:
> >>> I recommended to them that they join the dev list. I think that's the
> >>> easiest to discuss.
> >>> IMO, it's a good goal to have relative pointers in the metadata so
> that a
> >>> row group doesn't depend on where it is in a file.
> >>> It looks like some aspects of making the data updates more incremental
> >>> could leverage Iceberg.
> >>>
> >>> On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <an...@python.org>
> wrote:
> >>>
> >>>>
> >>>> I have a contact at Hugging Face who actually notified me of the blog
> >>>> post. I can transmit any questions or suggestions if desired.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> On Wed, 9 Oct 2024 11:50:51 +0100
> >>>> Steve Loughran
> >>>> <st...@cloudera.com.INVALID> wrote:
> >>>>> flatbuffer would be the obvious place that would be no compatibility
> >>>> issues
> >>>>> with existing readers.
> >>>>>
> >>>>> Also: that looks like a large amount of information to capture
> >> statistics
> >>>>> on. Has anyone approached them yet?
> >>>>>
> >>>>> On Wed, 9 Oct 2024 at 03:39, Gang Wu <
> >>>> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >>>>>
> >>>>>> Thanks Antoine for sharing the blog post!
> >>>>>>
> >>>>>> I skimmed it quickly and it seems that the main issue is the
> >> absolute
> >>>>>> file offset used by metadata of page and column chunk. It may take
> >> a
> >>>>>> long time to migrate if we want to replace them with relative
> >> offsets
> >>>>>> in the current thrift definition. Perhaps it is a good chance to
> >>>> improve
> >>>>>> this with the current flatbuffer experiment?
> >>>>>>
> >>>>>> Best,
> >>>>>> Gang
> >>>>>>
> >>>>>> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <
> >>>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> The Hugging Face developers published this insightful blog post
> >> about
> >>>>>>> their attempts to deduplicate Parquet files when they have
> >> similar
> >>>>>>> contents. They offer a couple suggestions for improvement at the
> >> end:
> >>>>>>> https://huggingface.co/blog/improve_parquet_dedupe
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
>

Reply via email to