RE: Concatenation of parquet files

Lee, David Fri, 15 Oct 2021 13:29:34 -0700

Each row group should have its own statistics footer or dictionary.. Your file 
structure should look like this:


> > *contents of parquet file A:*
> > ROW GROUP A1
> > FOOTER A1
> > ROW GROUP A2
> > FOOTER A2
> >
> > *contents of parquet file B:*
> > ROW GROUP B1
> > FOOTER B1
> > ROW GROUP B2
> > FOOTER B2

Merged:
> > ROW GROUP A1
> > FOOTER A1
> > ROW GROUP A2
> > FOOTER A2
> > ROW GROUP B1
> > FOOTER B1
> > ROW GROUP B2
> > FOOTER B2

I frequently concatenate smaller parquet files by appending rowgroups until I 
hit an optimal 125 meg file size for HDFS.

https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
"We can similarly write a Parquet file with multiple row groups by using 
ParquetWriter"

-----Original Message-----
From: Pau Tallada <[email protected]> 
Sent: Tuesday, September 14, 2021 6:01 AM
To: [email protected]
Subject: Re: Concatenation of parquet files

External Email: Use caution with links and attachments


Dear Gabor,

Thanks a lot for the clarification! ☺
I understand this is not a common use case, I somewhat just had hope it could 
be done easily :P

If you are interested, I attach a collab notebook where it shows this 
behaviour. The same data written three times produces different binary contents.
https://urldefense.com/v3/__https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$

Thanks again and best regards,

Pau

Missatge de Gabor Szadovszky <[email protected]> del dia dt., 14 de set.
2021 a les 10:54:

> Hi Pau,
>
> I guess attachments are not allowed in the apache lists so we cannot 
> see the image.
>
> If the two row groups contain the very same data in the same order and 
> encoded with the same encoding, compressed with the same codec I 
> think, they should be the same binary. I am not sure why you have 
> different binary streams for these row groups but if the proper data 
> can be decoded from both row groups I would not spend too much time on it.
>
> About merging row groups. It is a tough issue and far not that simple 
> as concatenating the row groups (files) and creating a new footer. 
> There are statistics in the footer that you have to take care about as 
> well as column indexes and bloom filters that are not part of the 
> footer and neither the row groups. (They are written in separate data 
> structures before the
> footer.)
> If you don't want to decode the row groups these statistics can be 
> updated (with the new offsets) as well as the new footer can be 
> created by reading the original footers only. The problem here is 
> creating such a parquet file is not very useful in most cases. Most of 
> the problems come from many small row groups (in small files) which 
> cannot be solved this way. To solve the small files problem we need to 
> merge the row groups and for that we need to decode the original data 
> so we can re-create the statistics (at least for bloom filters).
>
> Long story short, theoretically it is solvable but it is a feature we 
> haven't implemented properly so far.
>
> Cheers,
> Gabor
>
> On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <[email protected]> wrote:
>
> > Hi,
> >
> > I am a developer of cosmohub.pic.es, a science platform that 
> > provides interactive analysis and exploration of large scientific datasets.
> Working
> > with Hive, users are able to generate the subset of data they are 
> > interested in, and this result set is stored as a set of files. When
> users
> > want to download this dataset, we combine/concatenate all the files 
> > on-the-fly to generate a single stream that gets downloaded. Done 
> > right, this is very efficient, avoids materializing the combined 
> > file and the stream is even seekable so downloads can be resumed. We 
> > are able to do
> this
> > for csv.bz2 and FITS formats.
> >
> > I am trying to do the same with parquet. Looking at the format 
> > specification, it seems that it could be done by simply 
> > concatenating the binary blobs of the set of row groups and 
> > generating a new footer for the merged file. The problem is that the 
> > same data, written twice in the same file (in two row groups), is 
> > represented with some differences in the binary stream produced (see 
> > attached image). Why is the binary representation of a row group 
> > different if the data is the same? Is the order or position of a row group 
> > codified inside its metadata?
> >
> > I attach the image of a parquet file with the same data (a single 
> > integer column named 'c' with a single value 0) written twice, with 
> > at least two differences marked in red and blue.
> > [image: image.png]
> >
> >
> > A little diagram to show what I'm trying to accomplish:
> >
> > *contents of parquet file A:*
> > PAR1
> > ROW GROUP A1
> > ROW GROUP A2
> > FOOTER A
> >
> > *contents of parquet file B:*
> > PAR1
> > ROW GROUP B1
> > ROW GROUP B2
> > FOOTER B
> >
> > If I'm not mistaken, there is no metadata in each row group that 
> > refers
> to
> > its file or its position, so they should be relocatable. The final 
> > file/stream would look like this:
> >
> > *contents of combined parquet file:*
> > PAR1
> > ROW GROUP A1
> > ROW GROUP A2
> > ROW GROUP B1
> > ROW GROUP B2
> > NEW FOOTER A+B
> >
> > Thanks a lot in advance for the help understanding this,
> >
> > Best regards,
> >
> > Pau.
> > --
> > ----------------------------------
> > Pau Tallada Crespí
> > Departament de Serveis
> > Port d'Informació Científica (PIC)
> > Tel: +34 93 170 2729
> > ----------------------------------
> >
> >
>


--
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2021 BlackRock, Inc. All rights reserved.

RE: Concatenation of parquet files

Reply via email to