Hi Weston, Oh, I see... with those absolute pointers there is not much I can do, definitely :( Also, as the size in bytes also varies between writes (didn't expect that), I'll have to parse everything and merge on the fly :)
Thanks a lot for your help, Best regards, Pau. Missatge de Weston Pace <[email protected]> del dia dt., 14 de set. 2021 a les 22:48: > A few things that will be expected to change in your experiment (off a > cursory scan): > > RowGroup::Ordinal > ( > https://docs.rs/parquet-format/4.0.0/parquet_format/struct.RowGroup.html#structfield.ordinal > ) > RowGroup::FileOffset > ( > https://docs.rs/parquet-format/4.0.0/parquet_format/struct.RowGroup.html#structfield.file_offset > ) > ColumnChunk::FileOffset > ( > https://docs.rs/parquet-format/4.0.0/parquet_format/struct.ColumnChunk.html#structfield.file_offset > ) > ColumnChunk::OffsetIndexOffset > ColumnChunk::ColumnIndexOffset > > Basically a lot of the "pointers" in the metadata are absolute offsets > from the start of the file. Even though you keep resetting to write > to the same bytes there is no way for the writer to know that so the > offsets keep increasing. Some of these you probably don't care about > too much (they are different in your experiment but should not affect > your goal since they won't change when appending). However, something > like ColumnChunk::FileOffset is (I think) the location of the column > metadata (which is in the footer). So if you relocate the footer then > these offsets will need to be updated. > > On Tue, Sep 14, 2021 at 3:01 AM Pau Tallada <[email protected]> wrote: > > > > Dear Gabor, > > > > Thanks a lot for the clarification! ☺ > > I understand this is not a common use case, I somewhat just had hope it > > could be done easily :P > > > > If you are interested, I attach a collab notebook where it shows this > > behaviour. The same data written three times produces different binary > > contents. > > > https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing > > > > Thanks again and best regards, > > > > Pau > > > > Missatge de Gabor Szadovszky <[email protected]> del dia dt., 14 de set. > > 2021 a les 10:54: > > > > > Hi Pau, > > > > > > I guess attachments are not allowed in the apache lists so we cannot > see > > > the image. > > > > > > If the two row groups contain the very same data in the same order and > > > encoded with the same encoding, compressed with the same codec I think, > > > they should be the same binary. I am not sure why you have different > binary > > > streams for these row groups but if the proper data can be decoded from > > > both row groups I would not spend too much time on it. > > > > > > About merging row groups. It is a tough issue and far not that simple > as > > > concatenating the row groups (files) and creating a new footer. There > are > > > statistics in the footer that you have to take care about as well as > column > > > indexes and bloom filters that are not part of the footer and neither > the > > > row groups. (They are written in separate data structures before the > > > footer.) > > > If you don't want to decode the row groups these statistics can be > updated > > > (with the new offsets) as well as the new footer can be created by > reading > > > the original footers only. The problem here is creating such a parquet > file > > > is not very useful in most cases. Most of the problems come from many > small > > > row groups (in small files) which cannot be solved this way. To solve > the > > > small files problem we need to merge the row groups and for that we > need to > > > decode the original data so we can re-create the statistics (at least > for > > > bloom filters). > > > > > > Long story short, theoretically it is solvable but it is a feature we > > > haven't implemented properly so far. > > > > > > Cheers, > > > Gabor > > > > > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada <[email protected]> wrote: > > > > > > > Hi, > > > > > > > > I am a developer of cosmohub.pic.es, a science platform that > provides > > > > interactive analysis and exploration of large scientific datasets. > > > Working > > > > with Hive, users are able to generate the subset of data they are > > > > interested in, and this result set is stored as a set of files. When > > > users > > > > want to download this dataset, we combine/concatenate all the files > > > > on-the-fly to generate a single stream that gets downloaded. Done > right, > > > > this is very efficient, avoids materializing the combined file and > the > > > > stream is even seekable so downloads can be resumed. We are able to > do > > > this > > > > for csv.bz2 and FITS formats. > > > > > > > > I am trying to do the same with parquet. Looking at the format > > > > specification, it seems that it could be done by simply > concatenating the > > > > binary blobs of the set of row groups and generating a new footer > for the > > > > merged file. The problem is that the same data, written twice in the > same > > > > file (in two row groups), is represented with some differences in the > > > > binary stream produced (see attached image). Why is the binary > > > > representation of a row group different if the data is the same? Is > the > > > > order or position of a row group codified inside its metadata? > > > > > > > > I attach the image of a parquet file with the same data (a single > integer > > > > column named 'c' with a single value 0) written twice, with at least > two > > > > differences marked in red and blue. > > > > [image: image.png] > > > > > > > > > > > > A little diagram to show what I'm trying to accomplish: > > > > > > > > *contents of parquet file A:* > > > > PAR1 > > > > ROW GROUP A1 > > > > ROW GROUP A2 > > > > FOOTER A > > > > > > > > *contents of parquet file B:* > > > > PAR1 > > > > ROW GROUP B1 > > > > ROW GROUP B2 > > > > FOOTER B > > > > > > > > If I'm not mistaken, there is no metadata in each row group that > refers > > > to > > > > its file or its position, so they should be relocatable. The final > > > > file/stream would look like this: > > > > > > > > *contents of combined parquet file:* > > > > PAR1 > > > > ROW GROUP A1 > > > > ROW GROUP A2 > > > > ROW GROUP B1 > > > > ROW GROUP B2 > > > > NEW FOOTER A+B > > > > > > > > Thanks a lot in advance for the help understanding this, > > > > > > > > Best regards, > > > > > > > > Pau. > > > > -- > > > > ---------------------------------- > > > > Pau Tallada Crespí > > > > Departament de Serveis > > > > Port d'Informació Científica (PIC) > > > > Tel: +34 93 170 2729 > > > > ---------------------------------- > > > > > > > > > > > > > > > > > -- > > ---------------------------------- > > Pau Tallada Crespí > > Departament de Serveis > > Port d'Informació Científica (PIC) > > Tel: +34 93 170 2729 > > ---------------------------------- > -- ---------------------------------- Pau Tallada Crespí Departament de Serveis Port d'Informació Científica (PIC) Tel: +34 93 170 2729 ----------------------------------
