Re: Concatenation of parquet files

Micah Kornfield Fri, 15 Oct 2021 13:40:25 -0700

Hi David,
I'm not sure I understand.  Concatenating files like this would likely
break things.  In particular in the example:



> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2


There should only be one footer per file, otherwise, I don't think there is
any means of discovering the A row groups.  Also, without rewriting
metadata file offsets of B would be wrong (
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L790
).

https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> "We can similarly write a Parquet file with multiple row groups by using
> ParquetWriter"


Multiple row groups are fine.  Combining them after the fact  by simple
file concatenation (which is what i understand the original question to be)
would yield incorrect results.  If you reread small files and write them
out again in one pass, that would be fine.

Cheers,
Micah

On Fri, Oct 15, 2021 at 1:29 PM Lee, David <[email protected]>
wrote:

> Each row group should have its own statistics footer or dictionary.. Your
> file structure should look like this:
>
> > > *contents of parquet file A:*
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > >
> > > *contents of parquet file B:*
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> I frequently concatenate smaller parquet files by appending rowgroups
> until I hit an optimal 125 meg file size for HDFS.
>
>
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> "We can similarly write a Parquet file with multiple row groups by using
> ParquetWriter"
>
> -----Original Message-----
> From: Pau Tallada <[email protected]>
> Sent: Tuesday, September 14, 2021 6:01 AM
> To: [email protected]
> Subject: Re: Concatenation of parquet files
>
> External Email: Use caution with links and attachments
>
>
> Dear Gabor,
>
> Thanks a lot for the clarification! ☺
> I understand this is not a common use case, I somewhat just had hope it
> could be done easily :P
>
> If you are interested, I attach a collab notebook where it shows this
> behaviour. The same data written three times produces different binary
> contents.
>
> https://urldefense.com/v3/__https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$
>
> Thanks again and best regards,
>
> Pau
>
> Missatge de Gabor Szadovszky <[email protected]> del dia dt., 14 de set.
> 2021 a les 10:54:
>
> > Hi Pau,
> >
> > I guess attachments are not allowed in the apache lists so we cannot
> > see the image.
> >
> > If the two row groups contain the very same data in the same order and
> > encoded with the same encoding, compressed with the same codec I
> > think, they should be the same binary. I am not sure why you have
> > different binary streams for these row groups but if the proper data
> > can be decoded from both row groups I would not spend too much time on
> it.
> >
> > About merging row groups. It is a tough issue and far not that simple
> > as concatenating the row groups (files) and creating a new footer.
> > There are statistics in the footer that you have to take care about as
> > well as column indexes and bloom filters that are not part of the
> > footer and neither the row groups. (They are written in separate data
> > structures before the
> > footer.)
> > If you don't want to decode the row groups these statistics can be
> > updated (with the new offsets) as well as the new footer can be
> > created by reading the original footers only. The problem here is
> > creating such a parquet file is not very useful in most cases. Most of
> > the problems come from many small row groups (in small files) which
> > cannot be solved this way. To solve the small files problem we need to
> > merge the row groups and for that we need to decode the original data
> > so we can re-create the statistics (at least for bloom filters).
> >
> > Long story short, theoretically it is solvable but it is a feature we
> > haven't implemented properly so far.
> >
> > Cheers,
> > Gabor
> >
> > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I am a developer of cosmohub.pic.es, a science platform that
> > > provides interactive analysis and exploration of large scientific
> datasets.
> > Working
> > > with Hive, users are able to generate the subset of data they are
> > > interested in, and this result set is stored as a set of files. When
> > users
> > > want to download this dataset, we combine/concatenate all the files
> > > on-the-fly to generate a single stream that gets downloaded. Done
> > > right, this is very efficient, avoids materializing the combined
> > > file and the stream is even seekable so downloads can be resumed. We
> > > are able to do
> > this
> > > for csv.bz2 and FITS formats.
> > >
> > > I am trying to do the same with parquet. Looking at the format
> > > specification, it seems that it could be done by simply
> > > concatenating the binary blobs of the set of row groups and
> > > generating a new footer for the merged file. The problem is that the
> > > same data, written twice in the same file (in two row groups), is
> > > represented with some differences in the binary stream produced (see
> > > attached image). Why is the binary representation of a row group
> > > different if the data is the same? Is the order or position of a row
> group codified inside its metadata?
> > >
> > > I attach the image of a parquet file with the same data (a single
> > > integer column named 'c' with a single value 0) written twice, with
> > > at least two differences marked in red and blue.
> > > [image: image.png]
> > >
> > >
> > > A little diagram to show what I'm trying to accomplish:
> > >
> > > *contents of parquet file A:*
> > > PAR1
> > > ROW GROUP A1
> > > ROW GROUP A2
> > > FOOTER A
> > >
> > > *contents of parquet file B:*
> > > PAR1
> > > ROW GROUP B1
> > > ROW GROUP B2
> > > FOOTER B
> > >
> > > If I'm not mistaken, there is no metadata in each row group that
> > > refers
> > to
> > > its file or its position, so they should be relocatable. The final
> > > file/stream would look like this:
> > >
> > > *contents of combined parquet file:*
> > > PAR1
> > > ROW GROUP A1
> > > ROW GROUP A2
> > > ROW GROUP B1
> > > ROW GROUP B2
> > > NEW FOOTER A+B
> > >
> > > Thanks a lot in advance for the help understanding this,
> > >
> > > Best regards,
> > >
> > > Pau.
> > > --
> > > ----------------------------------
> > > Pau Tallada Crespí
> > > Departament de Serveis
> > > Port d'Informació Científica (PIC)
> > > Tel: +34 93 170 2729
> > > ----------------------------------
> > >
> > >
> >
>
>
> --
> ----------------------------------
> Pau Tallada Crespí
> Departament de Serveis
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> ----------------------------------
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2021 BlackRock, Inc. All rights reserved.
>

Re: Concatenation of parquet files

Reply via email to