(I see now this is probably what you mean by "implement a specialized format pre-process to modify those offsets for storage purposes")
On Thu, Oct 10, 2024 at 5:57 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > > It is not inherently aware of the Parquet file structure, and strive to > store > things byte-for-byte identical so we don't have to deal with parsers, > malformed > files, new formats, etc. > > I see -- thank you -- this is the key detail I didn't understand. > > I wonder if you could apply some normalization to the file prior to > deduplicating them (aka could you update your hash calculation so it > zero'ed out the relative offsets in a parquet files before checking for > equality? That would require applying some special case based on file > format, but the code is likely relatively simple > > > > > On Wed, Oct 9, 2024 at 9:05 PM Yucheng Low <y...@huggingface.co> wrote: > >> Hi Andrew! Have not seen you in a while! >> >> Back on topic, >> >> The deduplication procedure we are using is file-type independent and >> simply chunks the file into variable-sized chunks averaging ~ 64KB. >> It is not inherently aware of the Parquet file structure, and strive to >> store >> things byte-for-byte identical so we don't have to deal with parsers, >> malformed >> files, new formats, etc. >> >> Also, we operate (like git) on a snapshot basis. We are not storing >> information >> about how a file changed as we do not have that information, nor do we >> want >> to try to derive it. If we know the operations that changed the file, >> Iceberg >> will be the ideal solution I imagine. As such we need to try to identify >> common byte sequences which already exist "somewhere" in our system >> and dedupe accordingly. In a sense, what we are trying to do is >> orthogonal >> to Iceberg. (Deltas vs snapshots). >> >> However, the "file_offset" fields in RowGroup and ColumnChunk are not >> position independent in the file and so result in significant >> fragmentation, >> and for files with small row groups, poor deduplication. >> >> We could of course implement a specialized format pre-process to modify >> those offsets for storage purposes, but in my mind that is probably >> remarkably >> difficult to make resilient, where the goal is byte-for-byte identical. >> >> While we may just have to accept it for the current Parquet format, (we >> have >> some tricks to deal with fragmentation). if there are plans on updating >> the >> Parquet format, addressing the issue at the format layer (switch all >> absolute >> offsets to relative) will be great and may have future benefits here as >> well. >> >> Thanks, >> Yucheng >> >> > On Oct 9, 2024, at 5:38 PM, Andrew Lamb <andrewlam...@gmail.com> wrote: >> > >> > I am sorry for the likely dumb question, but I think I am missing >> something >> > >> > The blog post says " This means that any modification is likely to >> rewrite >> > all the Column headers." >> > >> > But my understanding of the parquet format is that the ColumnChunks[1] >> are >> > stored inline with the RowGroups which are stored in the footer. >> > >> > Thus I would expect that a parquet deduplication process could copy the >> > data for each row group memcpy style, and write a new footer with >> updated >> > offsets. This doesn't require rewriting the entire file, simply >> adjusting >> > offsets and writing a new footer. >> > >> > Andrew >> > >> > >> > [1] >> > >> https://github.com/apache/parquet-format/blob/4f208158dba80ff4bff4afaa4441d7270103dff6/src/main/thrift/parquet.thrift#L918 >> > >> > >> > On Wed, Oct 9, 2024 at 1:51 PM Yucheng Low <y...@huggingface.co> wrote: >> > >> >> Hi, >> >> >> >> I am the author of the blog here! >> >> Happy to answer any questions. >> >> >> >> There are a couple of parts, one is regarding relative pointers and a >> >> second is the row group chunking system (which for performance purposes >> >> could benefit from being implemented in the C/C++ layer). I am happy to >> >> help where I can with the latter as that can be done with the current >> >> Parquet version too. >> >> >> >> Thanks, >> >> Yucheng >> >> >> >> On 2024/10/09 15:46:01 Julien Le Dem wrote: >> >>> I recommended to them that they join the dev list. I think that's the >> >>> easiest to discuss. >> >>> IMO, it's a good goal to have relative pointers in the metadata so >> that a >> >>> row group doesn't depend on where it is in a file. >> >>> It looks like some aspects of making the data updates more incremental >> >>> could leverage Iceberg. >> >>> >> >>> On Wed, Oct 9, 2024 at 6:52 AM Antoine Pitrou <an...@python.org> >> wrote: >> >>> >> >>>> >> >>>> I have a contact at Hugging Face who actually notified me of the blog >> >>>> post. I can transmit any questions or suggestions if desired. >> >>>> >> >>>> Regards >> >>>> >> >>>> Antoine. >> >>>> >> >>>> >> >>>> On Wed, 9 Oct 2024 11:50:51 +0100 >> >>>> Steve Loughran >> >>>> <st...@cloudera.com.INVALID> wrote: >> >>>>> flatbuffer would be the obvious place that would be no compatibility >> >>>> issues >> >>>>> with existing readers. >> >>>>> >> >>>>> Also: that looks like a large amount of information to capture >> >> statistics >> >>>>> on. Has anyone approached them yet? >> >>>>> >> >>>>> On Wed, 9 Oct 2024 at 03:39, Gang Wu < >> >>>> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: >> >>>>> >> >>>>>> Thanks Antoine for sharing the blog post! >> >>>>>> >> >>>>>> I skimmed it quickly and it seems that the main issue is the >> >> absolute >> >>>>>> file offset used by metadata of page and column chunk. It may take >> >> a >> >>>>>> long time to migrate if we want to replace them with relative >> >> offsets >> >>>>>> in the current thrift definition. Perhaps it is a good chance to >> >>>> improve >> >>>>>> this with the current flatbuffer experiment? >> >>>>>> >> >>>>>> Best, >> >>>>>> Gang >> >>>>>> >> >>>>>> On Wed, Oct 9, 2024 at 8:51 AM Antoine Pitrou < >> >>>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote: >> >>>>>> >> >>>>>>> >> >>>>>>> Hello, >> >>>>>>> >> >>>>>>> The Hugging Face developers published this insightful blog post >> >> about >> >>>>>>> their attempts to deduplicate Parquet files when they have >> >> similar >> >>>>>>> contents. They offer a couple suggestions for improvement at the >> >> end: >> >>>>>>> https://huggingface.co/blog/improve_parquet_dedupe >> >>>>>>> >> >>>>>>> Regards >> >>>>>>> >> >>>>>>> Antoine. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>> >> >>