Re: Concatenation of parquet files

Pau Tallada Wed, 15 Sep 2021 01:55:38 -0700

Hi Weston,

Oh, I see... with those absolute pointers there is not much I can do,
definitely :(
Also, as the size in bytes also varies between writes (didn't expect that),
I'll have to parse everything and merge on the fly :)


Thanks a lot for your help,

Best regards,

Pau.

Missatge de Weston Pace <[email protected]> del dia dt., 14 de set.
2021 a les 22:48:

> A few things that will be expected to change in your experiment (off a
> cursory scan):
>
> RowGroup::Ordinal
> (
> https://docs.rs/parquet-format/4.0.0/parquet_format/struct.RowGroup.html#structfield.ordinal
> )
> RowGroup::FileOffset
> (
> https://docs.rs/parquet-format/4.0.0/parquet_format/struct.RowGroup.html#structfield.file_offset
> )
> ColumnChunk::FileOffset
> (
> https://docs.rs/parquet-format/4.0.0/parquet_format/struct.ColumnChunk.html#structfield.file_offset
> )
> ColumnChunk::OffsetIndexOffset
> ColumnChunk::ColumnIndexOffset
>
> Basically a lot of the "pointers" in the metadata are absolute offsets
> from the start of the file.  Even though you keep resetting to write
> to the same bytes there is no way for the writer to know that so the
> offsets keep increasing.  Some of these you probably don't care about
> too much (they are different in your experiment but should not affect
> your goal since they won't change when appending).  However, something
> like ColumnChunk::FileOffset is (I think) the location of the column
> metadata (which is in the footer).  So if you relocate the footer then
> these offsets will need to be updated.
>
> On Tue, Sep 14, 2021 at 3:01 AM Pau Tallada <[email protected]> wrote:
> >
> > Dear Gabor,
> >
> > Thanks a lot for the clarification! ☺
> > I understand this is not a common use case, I somewhat just had hope it
> > could be done easily :P
> >
> > If you are interested, I attach a collab notebook where it shows this
> > behaviour. The same data written three times produces different binary
> > contents.
> >
> https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing
> >
> > Thanks again and best regards,
> >
> > Pau
> >
> > Missatge de Gabor Szadovszky <[email protected]> del dia dt., 14 de set.
> > 2021 a les 10:54:
> >
> > > Hi Pau,
> > >
> > > I guess attachments are not allowed in the apache lists so we cannot
> see
> > > the image.
> > >
> > > If the two row groups contain the very same data in the same order and
> > > encoded with the same encoding, compressed with the same codec I think,
> > > they should be the same binary. I am not sure why you have different
> binary
> > > streams for these row groups but if the proper data can be decoded from
> > > both row groups I would not spend too much time on it.
> > >
> > > About merging row groups. It is a tough issue and far not that simple
> as
> > > concatenating the row groups (files) and creating a new footer. There
> are
> > > statistics in the footer that you have to take care about as well as
> column
> > > indexes and bloom filters that are not part of the footer and neither
> the
> > > row groups. (They are written in separate data structures before the
> > > footer.)
> > > If you don't want to decode the row groups these statistics can be
> updated
> > > (with the new offsets) as well as the new footer can be created by
> reading
> > > the original footers only. The problem here is creating such a parquet
> file
> > > is not very useful in most cases. Most of the problems come from many
> small
> > > row groups (in small files) which cannot be solved this way. To solve
> the
> > > small files problem we need to merge the row groups and for that we
> need to
> > > decode the original data so we can re-create the statistics (at least
> for
> > > bloom filters).
> > >
> > > Long story short, theoretically it is solvable but it is a feature we
> > > haven't implemented properly so far.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a developer of cosmohub.pic.es, a science platform that
> provides
> > > > interactive analysis and exploration of large scientific datasets.
> > > Working
> > > > with Hive, users are able to generate the subset of data they are
> > > > interested in, and this result set is stored as a set of files. When
> > > users
> > > > want to download this dataset, we combine/concatenate all the files
> > > > on-the-fly to generate a single stream that gets downloaded. Done
> right,
> > > > this is very efficient, avoids materializing the combined file and
> the
> > > > stream is even seekable so downloads can be resumed. We are able to
> do
> > > this
> > > > for csv.bz2 and FITS formats.
> > > >
> > > > I am trying to do the same with parquet. Looking at the format
> > > > specification, it seems that it could be done by simply
> concatenating the
> > > > binary blobs of the set of row groups and generating a new footer
> for the
> > > > merged file. The problem is that the same data, written twice in the
> same
> > > > file (in two row groups), is represented with some differences in the
> > > > binary stream produced (see attached image). Why is the binary
> > > > representation of a row group different if the data is the same? Is
> the
> > > > order or position of a row group codified inside its metadata?
> > > >
> > > > I attach the image of a parquet file with the same data (a single
> integer
> > > > column named 'c' with a single value 0) written twice, with at least
> two
> > > > differences marked in red and blue.
> > > > [image: image.png]
> > > >
> > > >
> > > > A little diagram to show what I'm trying to accomplish:
> > > >
> > > > *contents of parquet file A:*
> > > > PAR1
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > FOOTER A
> > > >
> > > > *contents of parquet file B:*
> > > > PAR1
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > FOOTER B
> > > >
> > > > If I'm not mistaken, there is no metadata in each row group that
> refers
> > > to
> > > > its file or its position, so they should be relocatable. The final
> > > > file/stream would look like this:
> > > >
> > > > *contents of combined parquet file:*
> > > > PAR1
> > > > ROW GROUP A1
> > > > ROW GROUP A2
> > > > ROW GROUP B1
> > > > ROW GROUP B2
> > > > NEW FOOTER A+B
> > > >
> > > > Thanks a lot in advance for the help understanding this,
> > > >
> > > > Best regards,
> > > >
> > > > Pau.
> > > > --
> > > > ----------------------------------
> > > > Pau Tallada Crespí
> > > > Departament de Serveis
> > > > Port d'Informació Científica (PIC)
> > > > Tel: +34 93 170 2729
> > > > ----------------------------------
> > > >
> > > >
> > >
> >
> >
> > --
> > ----------------------------------
> > Pau Tallada Crespí
> > Departament de Serveis
> > Port d'Informació Científica (PIC)
> > Tel: +34 93 170 2729
> > ----------------------------------
>


-- 
----------------------------------
Pau Tallada Crespí
Departament de Serveis
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Re: Concatenation of parquet files

Reply via email to