Re: Improving Parquet Dedupe

Andrew Lamb Fri, 11 Oct 2024 02:56:33 -0700

I apologize again for being dense, but I still don't understand how
changing parquet ColumnChunk or RowGroups to use relative  offsets in the
metadata structure would help deduplicating parquet files.


Those metadata structures (and thus offsets) is stored in the footer of the
file(at the end). All updated offsets therefore would be concentrated in
the footer (not the actual data in the file)

And indeed, the (very nice) red and green pictures in the blog[1] have a
red (changed) block at the end of each file, as expected.

The page headers[2] that appear before each page (inlined with the data)
don't appear to have offsets.

So maybe there is something else going on that causes the intermediate
contents of the files to differ?

Andrew

[1] https://huggingface.co/blog/improve_parquet_dedupe
[2]
https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L713


On Thu, Oct 10, 2024 at 12:26 PM Yucheng Low <y...@huggingface.co> wrote:

> Yep! Definitely something we can look into. I have not thought about too
> much yet...
> (Sounds tricky but who knows.) I was talking to Kenny Daniel (hyparquet),
> and he
> noticed that the offsets in the Parquet format are signed integers. So an
> "inplace substitution" could be to replace them with negative integers to
> denote
> offset...
>
> Anyway, in the meantime, there are a couple of ideas floating here.
>
> 1: An content-based heuristic for row group chunking. Is this of interest?
> If this is
> I can work on plumbing it as an option into the parquet writers.
> 2: If there is anything else with relative offsets I can help with.
>
> Thanks,
> Yucheng
>
> > On Oct 10, 2024, at 2:58 AM, Andrew Lamb <andrewlam...@gmail.com> wrote:
> >
> > (I see now this is probably what you mean by "implement a specialized
> > format pre-process to modify those offsets for storage purposes")
> >
> > On Thu, Oct 10, 2024 at 5:57 AM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
> >
> >>> It is not inherently aware of the Parquet file structure, and strive to
> >> store
> >> things byte-for-byte identical so we don't have to deal with parsers,
> >> malformed
> >> files, new formats, etc.
> >>
> >> I see -- thank you -- this is the key detail I didn't understand.
> >>
> >> I wonder if you could apply some normalization to the file prior to
> >> deduplicating them (aka could you update your hash calculation so it
> >> zero'ed out the relative offsets in a parquet files before checking for
> >> equality? That would require applying some special case based on file
> >> format, but the code is likely relatively simple
> >>
> >>
> >>
> >>
> >> On Wed, Oct 9, 2024 at 9:05 PM Yucheng Low <y...@huggingface.co> wrote:
> >>
> >>> Hi Andrew! Have not seen you in a while!
> >>>
> >>> Back on topic,
> >>>
> >>> The deduplication procedure we are using is file-type independent and
> >>> simply chunks the file into variable-sized chunks averaging ~ 64KB.
> >>> It is not inherently aware of the Parquet file structure, and strive to
> >>> store
> >>> things byte-for-byte identical so we don't have to deal with parsers,
> >>> malformed
> >>> files, new formats, etc.
> >>>
> >>> Also, we operate (like git) on a snapshot basis. We are not storing
> >>> information
> >>> about how a file changed as we do not have that information, nor do we
> >>> want
> >>> to try to derive it. If we know the operations that changed the file,
> >>> Iceberg
> >>> will be the ideal solution I imagine. As such we need to try to
> identify
> >>> common byte sequences which already exist "somewhere" in our system
> >>> and dedupe accordingly. In a sense, what we are trying to do is
> >>> orthogonal
> >>> to Iceberg. (Deltas vs snapshots).
> >>>
> >>> However, the "file_offset" fields in RowGroup and ColumnChunk are not
> >>> position independent in the file and so result in significant
> >>> fragmentation,
> >>> and for files with small row groups, poor deduplication.
> >>>
> >>> We could of course implement a specialized format pre-process to modify
> >>> those offsets for storage purposes, but in my mind that is probably
> >>> remarkably
> >>> difficult to make resilient, where the goal is byte-for-byte identical.
> >>>
> >>> While we may just have to accept it for the current Parquet format, (we
> >>> have
> >>> some tricks to deal with fragmentation). if there are plans on updating
> >>> the
> >>> Parquet format, addressing the issue at the format layer (switch all
> >>> absolute
> >>> offsets to relative) will be great and may have future benefits here as
> >>> well.
> >>>
> >>> Thanks,
> >>> Yucheng
> >>>
> >>>> On Oct 9, 2024, at 5:38 PM, Andrew Lamb <andrewlam...@gmail.com>
> wrote:
> >>>>
> >>>> I am sorry for the likely dumb question, but I think I am missing
> >>> something
> >>>>
> >>>> The blog post says " This means that any modification is likely to
> >>> rewrite
> >>>> all the Column headers."
> >>>>
> >>>> But my understanding of the parquet format is that the ColumnChunks[1]
> >>> are
> >>>> stored inline with the RowGroups which are stored in the footer.
> >>>>
> >>>> Thus I would expect that a parquet deduplication process could copy
> the
> >>>> data for each row group memcpy style, and write a new footer with
> >>> updated
> >>>> offsets. This doesn't require rewriting the entire file, simply
> >>> adjusting
> >>>> offsets and writing a new footer.
> >>>>
> >>>> Andrew
> >>>>
> >>>>
> >>>> [1]
> >>>>
> >>>
> https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918
> >>>>
> >>>>
> >>>> On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <y...@huggingface.co>
> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am the author of the blog here!
> >>>>> Happy to answer any questions.
> >>>>>
> >>>>> There are a couple of parts, one is regarding relative pointers and a
> >>>>> second is the row group chunking system (which for performance
> purposes
> >>>>> could benefit from being implemented in the C/C++ layer). I am happy
> to
> >>>>> help where I can with the latter as that can be done with the current
> >>>>> Parquet version too.
> >>>>>
> >>>>> Thanks,
> >>>>> Yucheng
> >>>>>
> >>>>> On 2024/10/09 15:46:01 Julien Le Dem wrote:
> >>>>>> I recommended to them that they join the dev list. I think that's
> the
> >>>>>> easiest to discuss.
> >>>>>> IMO, it's a good goal to have relative pointers in the metadata so
> >>> that a
> >>>>>> row group doesn't depend on where it is in a file.
> >>>>>> It looks like some aspects of making the data updates more
> incremental
> >>>>>> could leverage Iceberg.
> >>>>>>
> >>>>>> On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <an...@python.org>
> >>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> I have a contact at Hugging Face who actually notified me of the
> blog
> >>>>>>> post. I can transmit any questions or suggestions if desired.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, 9 Oct 2024 11:50:51 +0100
> >>>>>>> Steve Loughran
> >>>>>>> <st...@cloudera.com.INVALID> wrote:
> >>>>>>>> flatbuffer would be the obvious place that would be no
> compatibility
> >>>>>>> issues
> >>>>>>>> with existing readers.
> >>>>>>>>
> >>>>>>>> Also: that looks like a large amount of information to capture
> >>>>> statistics
> >>>>>>>> on. Has anyone approached them yet?
> >>>>>>>>
> >>>>>>>> On Wed, 9 Oct 2024 at 03:39, Gang Wu <
> >>>>>>> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks Antoine for sharing the blog post!
> >>>>>>>>>
> >>>>>>>>> I skimmed it quickly and it seems that the main issue is the
> >>>>> absolute
> >>>>>>>>> file offset used by metadata of page and column chunk. It may
> take
> >>>>> a
> >>>>>>>>> long time to migrate if we want to replace them with relative
> >>>>> offsets
> >>>>>>>>> in the current thrift definition. Perhaps it is a good chance to
> >>>>>>> improve
> >>>>>>>>> this with the current flatbuffer experiment?
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Gang
> >>>>>>>>>
> >>>>>>>>> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou <
> >>>>>>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> The Hugging Face developers published this insightful blog post
> >>>>> about
> >>>>>>>>>> their attempts to deduplicate Parquet files when they have
> >>>>> similar
> >>>>>>>>>> contents. They offer a couple suggestions for improvement at the
> >>>>> end:
> >>>>>>>>>> https://huggingface.co/blog/improve_parquet_dedupe
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>> Antoine.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>>
>
>

Re: Improving Parquet Dedupe

Reply via email to