> It is not inherently aware of the Parquet file structure, and strive to store things byte-for-byte identical so we don't have to deal with parsers, malformed files, new formats, etc.
I see -- thank you -- this is the key detail I didn't understand. I wonder if you could apply some normalization to the file prior to deduplicating them (aka could you update your hash calculation so it zero'ed out the relative offsets in a parquet files before checking for equality? That would require applying some special case based on file format, but the code is likely relatively simple On Wed, Oct 9, 2024 at 9:05 PM Yucheng Low <y...@huggingface.co> wrote: > Hi Andrew! Have not seen you in a while! > > Back on topic, > > The deduplication procedure we are using is file-type independent and > simply chunks the file into variable-sized chunks averaging ~ 64KB. > It is not inherently aware of the Parquet file structure, and strive to > store > things byte-for-byte identical so we don't have to deal with parsers, > malformed > files, new formats, etc. > > Also, we operate (like git) on a snapshot basis. We are not storing > information > about how a file changed as we do not have that information, nor do we want > to try to derive it. If we know the operations that changed the file, > Iceberg > will be the ideal solution I imagine. As such we need to try to identify > common byte sequences which already exist "somewhere" in our system > and dedupe accordingly. In a sense, what we are trying to do is orthogonal > to Iceberg. (Deltas vs snapshots). > > However, the "file_offset" fields in RowGroup and ColumnChunk are not > position independent in the file and so result in significant > fragmentation, > and for files with small row groups, poor deduplication. > > We could of course implement a specialized format pre-process to modify > those offsets for storage purposes, but in my mind that is probably > remarkably > difficult to make resilient, where the goal is byte-for-byte identical. > > While we may just have to accept it for the current Parquet format, (we > have > some tricks to deal with fragmentation). if there are plans on updating > the > Parquet format, addressing the issue at the format layer (switch all > absolute > offsets to relative) will be great and may have future benefits here as > well. > > Thanks, > Yucheng > > > On Oct 9, 2024, at 5:38 PM, Andrew Lamb <andrewlam...@gmail.com> wrote: > > > > I am sorry for the likely dumb question, but I think I am missing > something > > > > The blog post says " This means that any modification is likely to > rewrite > > all the Column headers." > > > > But my understanding of the parquet format is that the ColumnChunks[1] > are > > stored inline with the RowGroups which are stored in the footer. > > > > Thus I would expect that a parquet deduplication process could copy the > > data for each row group memcpy style, and write a new footer with updated > > offsets. This doesn't require rewriting the entire file, simply adjusting > > offsets and writing a new footer. > > > > Andrew > > > > > > [1] > > > https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918 > > > > > > On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <y...@huggingface.co> wrote: > > > >> Hi, > >> > >> I am the author of the blog here! > >> Happy to answer any questions. > >> > >> There are a couple of parts, one is regarding relative pointers and a > >> second is the row group chunking system (which for performance purposes > >> could benefit from being implemented in the C/C++ layer). I am happy to > >> help where I can with the latter as that can be done with the current > >> Parquet version too. > >> > >> Thanks, > >> Yucheng > >> > >> On 2024/10/09 15:46:01 Julien Le Dem wrote: > >>> I recommended to them that they join the dev list. I think that's the > >>> easiest to discuss. > >>> IMO, it's a good goal to have relative pointers in the metadata so > that a > >>> row group doesn't depend on where it is in a file. > >>> It looks like some aspects of making the data updates more incremental > >>> could leverage Iceberg. > >>> > >>> On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <an...@python.org> > wrote: > >>> > >>>> > >>>> I have a contact at Hugging Face who actually notified me of the blog > >>>> post. I can transmit any questions or suggestions if desired. > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>>> > >>>> > >>>> On Wed, 9 Oct 2024 11:50:51 +0100 > >>>> Steve Loughran > >>>> <st...@cloudera.com.INVALID> wrote: > >>>>> flatbuffer would be the obvious place that would be no compatibility > >>>> issues > >>>>> with existing readers. > >>>>> > >>>>> Also: that looks like a large amount of information to capture > >> statistics > >>>>> on. Has anyone approached them yet? > >>>>> > >>>>> On Wed, 9 Oct 2024 at 03:39, Gang Wu < > >>>> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > >>>>> > >>>>>> Thanks Antoine for sharing the blog post! > >>>>>> > >>>>>> I skimmed it quickly and it seems that the main issue is the > >> absolute > >>>>>> file offset used by metadata of page and column chunk. It may take > >> a > >>>>>> long time to migrate if we want to replace them with relative > >> offsets > >>>>>> in the current thrift definition. Perhaps it is a good chance to > >>>> improve > >>>>>> this with the current flatbuffer experiment? > >>>>>> > >>>>>> Best, > >>>>>> Gang > >>>>>> > >>>>>> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou < > >>>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote: > >>>>>> > >>>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> The Hugging Face developers published this insightful blog post > >> about > >>>>>>> their attempts to deduplicate Parquet files when they have > >> similar > >>>>>>> contents. They offer a couple suggestions for improvement at the > >> end: > >>>>>>> https://huggingface.co/blog/improve_parquet_dedupe > >>>>>>> > >>>>>>> Regards > >>>>>>> > >>>>>>> Antoine. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> > >>> > >