Each row group should have its own statistics footer or dictionary.. Your file structure should look like this:
> > *contents of parquet file A:* > > ROW GROUP A1 > > FOOTER A1 > > ROW GROUP A2 > > FOOTER A2 > > > > *contents of parquet file B:* > > ROW GROUP B1 > > FOOTER B1 > > ROW GROUP B2 > > FOOTER B2 Merged: > > ROW GROUP A1 > > FOOTER A1 > > ROW GROUP A2 > > FOOTER A2 > > ROW GROUP B1 > > FOOTER B1 > > ROW GROUP B2 > > FOOTER B2 I frequently concatenate smaller parquet files by appending rowgroups until I hit an optimal 125 meg file size for HDFS. https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing "We can similarly write a Parquet file with multiple row groups by using ParquetWriter" -----Original Message----- From: Pau Tallada <[email protected]> Sent: Tuesday, September 14, 2021 6:01 AM To: [email protected] Subject: Re: Concatenation of parquet files External Email: Use caution with links and attachments Dear Gabor, Thanks a lot for the clarification! ☺ I understand this is not a common use case, I somewhat just had hope it could be done easily :P If you are interested, I attach a collab notebook where it shows this behaviour. The same data written three times produces different binary contents. https://urldefense.com/v3/__https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$ Thanks again and best regards, Pau Missatge de Gabor Szadovszky <[email protected]> del dia dt., 14 de set. 2021 a les 10:54: > Hi Pau, > > I guess attachments are not allowed in the apache lists so we cannot > see the image. > > If the two row groups contain the very same data in the same order and > encoded with the same encoding, compressed with the same codec I > think, they should be the same binary. I am not sure why you have > different binary streams for these row groups but if the proper data > can be decoded from both row groups I would not spend too much time on it. > > About merging row groups. It is a tough issue and far not that simple > as concatenating the row groups (files) and creating a new footer. > There are statistics in the footer that you have to take care about as > well as column indexes and bloom filters that are not part of the > footer and neither the row groups. (They are written in separate data > structures before the > footer.) > If you don't want to decode the row groups these statistics can be > updated (with the new offsets) as well as the new footer can be > created by reading the original footers only. The problem here is > creating such a parquet file is not very useful in most cases. Most of > the problems come from many small row groups (in small files) which > cannot be solved this way. To solve the small files problem we need to > merge the row groups and for that we need to decode the original data > so we can re-create the statistics (at least for bloom filters). > > Long story short, theoretically it is solvable but it is a feature we > haven't implemented properly so far. > > Cheers, > Gabor > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <[email protected]> wrote: > > > Hi, > > > > I am a developer of cosmohub.pic.es, a science platform that > > provides interactive analysis and exploration of large scientific datasets. > Working > > with Hive, users are able to generate the subset of data they are > > interested in, and this result set is stored as a set of files. When > users > > want to download this dataset, we combine/concatenate all the files > > on-the-fly to generate a single stream that gets downloaded. Done > > right, this is very efficient, avoids materializing the combined > > file and the stream is even seekable so downloads can be resumed. We > > are able to do > this > > for csv.bz2 and FITS formats. > > > > I am trying to do the same with parquet. Looking at the format > > specification, it seems that it could be done by simply > > concatenating the binary blobs of the set of row groups and > > generating a new footer for the merged file. The problem is that the > > same data, written twice in the same file (in two row groups), is > > represented with some differences in the binary stream produced (see > > attached image). Why is the binary representation of a row group > > different if the data is the same? Is the order or position of a row group > > codified inside its metadata? > > > > I attach the image of a parquet file with the same data (a single > > integer column named 'c' with a single value 0) written twice, with > > at least two differences marked in red and blue. > > [image: image.png] > > > > > > A little diagram to show what I'm trying to accomplish: > > > > *contents of parquet file A:* > > PAR1 > > ROW GROUP A1 > > ROW GROUP A2 > > FOOTER A > > > > *contents of parquet file B:* > > PAR1 > > ROW GROUP B1 > > ROW GROUP B2 > > FOOTER B > > > > If I'm not mistaken, there is no metadata in each row group that > > refers > to > > its file or its position, so they should be relocatable. The final > > file/stream would look like this: > > > > *contents of combined parquet file:* > > PAR1 > > ROW GROUP A1 > > ROW GROUP A2 > > ROW GROUP B1 > > ROW GROUP B2 > > NEW FOOTER A+B > > > > Thanks a lot in advance for the help understanding this, > > > > Best regards, > > > > Pau. > > -- > > ---------------------------------- > > Pau Tallada Crespí > > Departament de Serveis > > Port d'Informació Científica (PIC) > > Tel: +34 93 170 2729 > > ---------------------------------- > > > > > -- ---------------------------------- Pau Tallada Crespí Departament de Serveis Port d'Informació Científica (PIC) Tel: +34 93 170 2729 ---------------------------------- This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2021 BlackRock, Inc. All rights reserved.
