RE: Concatenation of parquet files
Here was my solution back in 2018.. It's easier to do now with pyarrow's python APIs than Spark.. https://stackoverflow.com/questions/39187622/how-do-you-control-the-size-of-the-output-file/51216145#51216145 Read all the smaller files in your list one at a time and write them to the temp file as parquet ROW GROUP. It is very important to write each file in as a row group which preserves compression encoding and guarantees the amount of bytes (minus schema metadata) written will be the same as the original file size. -Original Message- From: Lee, David Sent: Friday, October 15, 2021 2:04 PM To: dev@parquet.apache.org; 'emkornfi...@gmail.com' ; david@blackrock.com.invalid Subject: RE: Concatenation of parquet files Well this is right and wrong.. There is one footer, but the statistics are captured per row group which allows rowgroups to be easily concatenated into a new file without rebuiliding column stats. The final file looks more like: > > > ROW GROUP A1 > > > ROW GROUP A2 > > > ROW GROUP B1 > > > ROW GROUP B2 > > > FOOTER A1, A2, B1, B2 http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/ When all the row groups are written and before the closing the file the Parquet writer adds the footer to the end of the file. The footer includes the file schema (column names and their types) as well as details about every row group (total size, number of rows, min/max statistics, number of NULL values for every column). Note that this column statistics is per row group, not for the entire file. -Original Message- From: Micah Kornfield Sent: Friday, October 15, 2021 1:40 PM To: david@blackrock.com.invalid Cc: dev@parquet.apache.org Subject: Re: Concatenation of parquet files External Email: Use caution with links and attachments Hi David, I'm not sure I understand. Concatenating files like this would likely break things. In particular in the example: > Merged: > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 There should only be one footer per file, otherwise, I don't think there is any means of discovering the A row groups. Also, without rewriting metadata file offsets of B would be wrong ( https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$ ). https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$ > "We can similarly write a Parquet file with multiple row groups by > using ParquetWriter" Multiple row groups are fine. Combining them after the fact by simple file concatenation (which is what i understand the original question to be) would yield incorrect results. If you reread small files and write them out again in one pass, that would be fine. Cheers, Micah On Fri, Oct 15, 2021 at 1:29 PM Lee, David wrote: > Each row group should have its own statistics footer or dictionary.. > Your file structure should look like this: > > > > *contents of parquet file A:* > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > > > > *contents of parquet file B:* > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 > > Merged: > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 > > I frequently concatenate smaller parquet files by appending rowgroups > until I hit an optimal 125 meg file size for HDFS. > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu > et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9 > scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$ > "We can similarly write a Parquet file with multiple row groups by > using ParquetWriter" > > -Original Message- > From: Pau Tallada > Sent: Tuesday, September 14, 2021 6:01 AM > To: dev@parquet.apache.org > Subject: Re: Concatenation of parquet files > > External Email: Use caution with links and attachments > > > Dear Gabor, > > Thanks a lot for the clarification! ☺ > I understand this is not a common use case, I somewhat just had hope > it could be done easily :P > > If you are interested, I attach a collab notebook where it shows this > behaviour.
RE: Concatenation of parquet files
Well this is right and wrong.. There is one footer, but the statistics are captured per row group which allows rowgroups to be easily concatenated into a new file without rebuiliding column stats. The final file looks more like: > > > ROW GROUP A1 > > > ROW GROUP A2 > > > ROW GROUP B1 > > > ROW GROUP B2 > > > FOOTER A1, A2, B1, B2 http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/ When all the row groups are written and before the closing the file the Parquet writer adds the footer to the end of the file. The footer includes the file schema (column names and their types) as well as details about every row group (total size, number of rows, min/max statistics, number of NULL values for every column). Note that this column statistics is per row group, not for the entire file. -Original Message- From: Micah Kornfield Sent: Friday, October 15, 2021 1:40 PM To: david@blackrock.com.invalid Cc: dev@parquet.apache.org Subject: Re: Concatenation of parquet files External Email: Use caution with links and attachments Hi David, I'm not sure I understand. Concatenating files like this would likely break things. In particular in the example: > Merged: > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 There should only be one footer per file, otherwise, I don't think there is any means of discovering the A row groups. Also, without rewriting metadata file offsets of B would be wrong ( https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$ ). https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$ > "We can similarly write a Parquet file with multiple row groups by > using ParquetWriter" Multiple row groups are fine. Combining them after the fact by simple file concatenation (which is what i understand the original question to be) would yield incorrect results. If you reread small files and write them out again in one pass, that would be fine. Cheers, Micah On Fri, Oct 15, 2021 at 1:29 PM Lee, David wrote: > Each row group should have its own statistics footer or dictionary.. > Your file structure should look like this: > > > > *contents of parquet file A:* > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > > > > *contents of parquet file B:* > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 > > Merged: > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 > > I frequently concatenate smaller parquet files by appending rowgroups > until I hit an optimal 125 meg file size for HDFS. > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu > et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9 > scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$ > "We can similarly write a Parquet file with multiple row groups by > using ParquetWriter" > > -Original Message- > From: Pau Tallada > Sent: Tuesday, September 14, 2021 6:01 AM > To: dev@parquet.apache.org > Subject: Re: Concatenation of parquet files > > External Email: Use caution with links and attachments > > > Dear Gabor, > > Thanks a lot for the clarification! ☺ > I understand this is not a common use case, I somewhat just had hope > it could be done easily :P > > If you are interested, I attach a collab notebook where it shows this > behaviour. The same data written three times produces different binary > contents. > > https://urldefense.com/v3/__https://colab.research.google.com/drive/1z > 7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe > 2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$ > > Thanks again and best regards, > > Pau > > Missatge de Gabor Szadovszky del dia dt., 14 de set. > 2021 a les 10:54: > > > Hi Pau, > > > > I guess attachments are not allowed in the apache lists so we cannot > > see the image. > > > > If the two row groups contain the very same data in the same order > > and encoded with the same encoding, compressed with the same codec I >
RE: Concatenation of parquet files
Each row group should have its own statistics footer or dictionary.. Your file structure should look like this: > > *contents of parquet file A:* > > ROW GROUP A1 > > FOOTER A1 > > ROW GROUP A2 > > FOOTER A2 > > > > *contents of parquet file B:* > > ROW GROUP B1 > > FOOTER B1 > > ROW GROUP B2 > > FOOTER B2 Merged: > > ROW GROUP A1 > > FOOTER A1 > > ROW GROUP A2 > > FOOTER A2 > > ROW GROUP B1 > > FOOTER B1 > > ROW GROUP B2 > > FOOTER B2 I frequently concatenate smaller parquet files by appending rowgroups until I hit an optimal 125 meg file size for HDFS. https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing "We can similarly write a Parquet file with multiple row groups by using ParquetWriter" -Original Message- From: Pau Tallada Sent: Tuesday, September 14, 2021 6:01 AM To: dev@parquet.apache.org Subject: Re: Concatenation of parquet files External Email: Use caution with links and attachments Dear Gabor, Thanks a lot for the clarification! ☺ I understand this is not a common use case, I somewhat just had hope it could be done easily :P If you are interested, I attach a collab notebook where it shows this behaviour. The same data written three times produces different binary contents. https://urldefense.com/v3/__https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$ Thanks again and best regards, Pau Missatge de Gabor Szadovszky del dia dt., 14 de set. 2021 a les 10:54: > Hi Pau, > > I guess attachments are not allowed in the apache lists so we cannot > see the image. > > If the two row groups contain the very same data in the same order and > encoded with the same encoding, compressed with the same codec I > think, they should be the same binary. I am not sure why you have > different binary streams for these row groups but if the proper data > can be decoded from both row groups I would not spend too much time on it. > > About merging row groups. It is a tough issue and far not that simple > as concatenating the row groups (files) and creating a new footer. > There are statistics in the footer that you have to take care about as > well as column indexes and bloom filters that are not part of the > footer and neither the row groups. (They are written in separate data > structures before the > footer.) > If you don't want to decode the row groups these statistics can be > updated (with the new offsets) as well as the new footer can be > created by reading the original footers only. The problem here is > creating such a parquet file is not very useful in most cases. Most of > the problems come from many small row groups (in small files) which > cannot be solved this way. To solve the small files problem we need to > merge the row groups and for that we need to decode the original data > so we can re-create the statistics (at least for bloom filters). > > Long story short, theoretically it is solvable but it is a feature we > haven't implemented properly so far. > > Cheers, > Gabor > > On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada wrote: > > > Hi, > > > > I am a developer of cosmohub.pic.es, a science platform that > > provides interactive analysis and exploration of large scientific datasets. > Working > > with Hive, users are able to generate the subset of data they are > > interested in, and this result set is stored as a set of files. When > users > > want to download this dataset, we combine/concatenate all the files > > on-the-fly to generate a single stream that gets downloaded. Done > > right, this is very efficient, avoids materializing the combined > > file and the stream is even seekable so downloads can be resumed. We > > are able to do > this > > for csv.bz2 and FITS formats. > > > > I am trying to do the same with parquet. Looking at the format > > specification, it seems that it could be done by simply > > concatenating the binary blobs of the set of row groups and > > generating a new footer for the merged file. The problem is that the > > same data, written twice in the same file (in two row groups), is > > represented with some differences in the binary stream produced (see > > attached image). Why is the binary representation of a row group > > different if the data is the same? Is the order or position of a row group > > codified inside its metadata? > > > > I attach the image of a parquet file with the same data (a single > > integer column named 'c' with a single value 0) written twice, with > > at least two differences marked in red and blue. > > [image: image.png] > > > > > > A little diagram to show what I'm trying to accomplish: > > > > *contents of parquet file A:* > > PAR1 > > ROW GROUP A1 > > ROW GROUP A2 > > FOOTER A > > > > *contents of parquet file B:* > > PAR1 > > ROW GROUP B1 > > ROW GROUP B2 > > FOOTER B > > > > If I'm not mistaken, there is no metadata in each row
RE: Parquet File Naming Convention Standards
I've tried the one row group per parquet file / block and I ran into couple problems with some observations.. 1. The single row group would contain 30 million rows x 10 columns for data. This requires a lot more memory to write the file. Saving 10 row groups one at a time into a single parquet file cuts the max memory usage down to 3 million rows. 2. Dictionary encoding only works if the dictionary values do not exceed the reserved space in a parquet file. Each row group has its own reserved space for dictionary values. Once you exceed the reserved space then dictionary encoding isn't used which can lead to slower query performance and increase the overall storage needed by 10% or more. 3. I generally try to store 30 million cells of data per row group. 3 million by 10 columns or 10 million rows x 3 columns, etc.. -Original Message- From: Tim Armstrong Sent: Wednesday, May 22, 2019 12:27 PM To: Parquet Dev Subject: Re: Parquet File Naming Convention Standards External Email: Use caution with links and attachments Not reusing file names is generally a good idea - there are a bunch of interesting consistency issues, particularly on object stores, if you reuse file paths. This has come up for us with things like INSERT OVERWRITE in Hive, which tends to generate the same file names. I think there's an interesting set of discussions to be had around best practices for file sizes and row group sizes. One point is that a lot of big data frameworks schedule parallel work based on filesystem metadata only (i.e. file sizes and block sizes, if the filesystem has a concept of a block). If you have arbitrary parquet files this can break down in various ways - e.g. if you have a 1GB file, you have to guess what a good way to divide up the processing is. If there are fewer row groups than expected, you'll get skew and if there are more you'll lose out on parallelism. HDFS blocks were often a good way to do this, since a lot of writers aim for one row group per block, but Parquet files often come from a variety of sources and get munged in different ways, so the heuristic falls over in various ways in some application. It's somewhat worse on object stores like S3, where there isn't a concept of a block, just whatever the writer and reader have configured - you really ideally want reader and writer block sizes to line up, but coordinating can be difficult for some workflows. Working on Impala, I'm a bit biased towards larger blocks, because of the scheduling problems and also because of the extra overhead added with row groups - we end up needed to do extra I/O operations per row group, adding overhead (some of the overhead is inherent because the data you're reading is more fragmented, so of it is just our implementation). On Wed, May 22, 2019 at 11:55 AM Brian Bowman wrote: > Thanks for the info! > > HDFS is only one of many storage platforms (distributed or otherwise) > that SAS supports. In general larger physical files (e.g. 100MB to > 1GB) with multiple RowGroups are also a good thing for our usage > cases. I'm working to get our Parquet (C to C++ via libparquet.so) writer to > do this. > > -Brian > > On 5/22/19, 1:21 PM, "Lee, David" wrote: > > EXTERNAL > > I'm not a big fan of this convention which is a Spark convention.. > > A. The files should have at least "foo" in the name. Using PyArrow > I would create these files as foo.1.parquet, foo.2.parquet, etc.. > B. These files are around 3 megs each. For HDFS storage, files > should be sized to match the HDFS blocksize which is usually set at > 128 megs > (default) or 256 megs, 512 megs, 1 gig, etc.. > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.cloudera.com > _blog_2009_02_the-2Dsmall-2Dfiles-2Dproblem_=DwIFaQ=zUO0BtkCe66yJv > AZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Wsz97R5QSnh4U > ivp0SuIfu3GIlO6rHwWLWE6O-Ib7ZE=7ppV9DET_wMkOjgvgATKUoIel_zxLOwnRDPET > jrveyc= > > I usually take small parquet files and save them as parquet row > groups in a larger parquet file to match the HDFS blocksize. > > -Original Message- > From: Brian Bowman > Sent: Wednesday, May 22, 2019 8:40 AM > To: dev@parquet.apache.org > Subject: Parquet File Naming Convention Standards > > External Email: Use caution with links and attachments > > > All, > > Here is an example .parquet data set saved using pySpark where the > following files are members of directory: “foo.parquet”: > > -rw-r--r--1 sasbpb r8 Mar 26 12:10 ._SUCCESS.crc > -rw-r--r--1 sasbpb r25632 Mar 26 12:10 > .part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc > -rw-r--r--1 sasbpb r25356 Mar 26 12:10 > .part-1-b84abe50-a92b-4b2b-b011-3099
RE: Parquet File Naming Convention Standards
I'm not a big fan of this convention which is a Spark convention.. A. The files should have at least "foo" in the name. Using PyArrow I would create these files as foo.1.parquet, foo.2.parquet, etc.. B. These files are around 3 megs each. For HDFS storage, files should be sized to match the HDFS blocksize which is usually set at 128 megs (default) or 256 megs, 512 megs, 1 gig, etc.. https://blog.cloudera.com/blog/2009/02/the-small-files-problem/ I usually take small parquet files and save them as parquet row groups in a larger parquet file to match the HDFS blocksize. -Original Message- From: Brian Bowman Sent: Wednesday, May 22, 2019 8:40 AM To: dev@parquet.apache.org Subject: Parquet File Naming Convention Standards External Email: Use caution with links and attachments All, Here is an example .parquet data set saved using pySpark where the following files are members of directory: “foo.parquet”: -rw-r--r--1 sasbpb r8 Mar 26 12:10 ._SUCCESS.crc -rw-r--r--1 sasbpb r25632 Mar 26 12:10 .part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r25356 Mar 26 12:10 .part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r26300 Mar 26 12:10 .part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r23728 Mar 26 12:10 .part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc -rw-r--r--1 sasbpb r0 Mar 26 12:10 _SUCCESS -rw-r--r--1 sasbpb r 3279617 Mar 26 12:10 part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3244105 Mar 26 12:10 part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3365039 Mar 26 12:10 part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet -rw-r--r--1 sasbpb r 3035960 Mar 26 12:10 part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet Questions: 1. Is this the “standard” for creating/saving a .parquet data set? 2. It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID. Is the format: part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an established convention? Is this documented somewhere? 3. Is there a C++ class to create the CRC? Thanks, Brian This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2019 BlackRock, Inc. All rights reserved.
RE: parquet-arrow estimate file size
In my experience and experiments it is really hard to approximate target sizes. A single parquet file with a single row group could be 20% larger than a parquet files with 20 row groups because if you have a lot of rows with a lot of data variety you can lose dictionary encoding options. I predetermine my row group sizes by creating them as files and then write them to a single parquet file. A better approach would probably be to write the row group to a single file and once the size exceeds your target size, remove the last row group written and start a new file with it, but I don't think there is a method to remove a row group right now. Another option would be to write the row group out as a file object in memory to predetermine its size before adding it as a row group in a parquet file. -Original Message- From: Wes McKinney Sent: Tuesday, December 11, 2018 7:16 AM To: Parquet Dev Subject: Re: parquet-arrow estimate file size External Email: Use caution with links and attachments hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to control or examine the size of files as they are being written. Ideally we would develop an interface to write a sequence of arrow::RecordBatch objects that would automatically move on to a new file once a certain approximate target size has been reached in an existing file. There's a number of moving parts that would need to be created to make this possible. - Wes On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal wrote: > > I think if I've understood the problem correctly, you could use the > parquet::arrow::FileWriter > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache > _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128=DwIFaQ=zU > O0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=r > rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko=zQJ4skn8jLtkXiPTGWljgDyof > gJTKIAyAeCBwHQuamw= > > The basic pattern is to use an object to manage the FileWriter lifetime, call > the WriteTable method for each row group, and close it when you are done. My > understanding is that each call to WriteTable will append a new row group > which should allow you to incrementally write an out-of-memory dataset. I > realize now that I haven't tested this myself so it would be good to > double-check this with someone more experienced with the parquet-cpp APIs. > > On 12/11/18, 12:54 AM, "Jiayuan Chen" wrote: > > Thanks for the suggestion, will do. > > Since such high-level API is not yet implemented in the parquet-cpp > project, I have to turn back to use the API newly introduced in the > low-level API, that calculates the Parquet file size when adding data into > the column writers. I have another question on that part: > > Is there any sample code & advice that I can follow to be able to stream > the Parquet file on a per rowgroup basis? In order words, to restrict > memory usage but still create big enough Parquet file, I would like to > create relatively small rowgroup in memory using InMemoryOutputStream(), > and dump the buffer contents to my external stream, after completing each > row group, until a big file with several rowgroups is finished. However, > my > attempt to manipulate the underline arrow::Buffer have failed, that the > pages starting from the second rowgroup are unreadable. > > Thanks! > > On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney wrote: > > > hi Jiayuan, > > > > To your question > > > > > Would this be in the roadmap? > > > > I doubt there would be any objections to adding this feature to the > > Arrow writer API -- please feel free to open a JIRA issue to describe > > how the API might work in C++. Note there is no formal roadmap in this > > project. > > > > - Wes > > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen wrote: > > > > > > Thanks for the Python solution. However, is there a solution in C++ > that > > I > > > can create such Parquet file with only in-memory buffer, using > > parquet-cpp > > > library? > > > > > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David > > wrote: > > > > > > > Resending.. Somehow I lost some line feeds in the previous reply.. > > > > > > > > import os > > > > import pyarrow.parquet as pq > > > > import glob as glob > > > > > > > > max_target_size = 134217728 > > > > target_size = max_target_size * .95 > > > > # Directory where parquet files are sa
RE: parquet-arrow estimate file size
Resending.. Somehow I lost some line feeds in the previous reply.. import os import pyarrow.parquet as pq import glob as glob max_target_size = 134217728 target_size = max_target_size * .95 # Directory where parquet files are saved working_directory = '/tmp/test' files_dict = dict() files = glob.glob(os.path.join(working_directory, "*.parquet")) files.sort() for file in files: files_dict[file] = os.path.getsize(file) print("Merging parquet files") temp_file = os.path.join(working_directory, "temp.parquet") file_no = 0 for file in files: if file in files_dict: file_no = file_no + 1 file_name = os.path.join(working_directory, str(file_no).zfill(4) + ".parquet") print("Saving to parquet file " + file_name) # Just rename file if the file size is in target range if files_dict[file] > target_size: del files_dict[file] os.rename(file, file_name) continue merge_list = list() file_size = 0 # Find files to merge together which add up to less than 128 megs for k, v in files_dict.items(): if file_size + v <= max_target_size: print("Adding file " + k + " to merge list") merge_list.append(k) file_size = file_size + v # Just rename file if there is only one file to merge if len(merge_list) == 1: del files_dict[merge_list[0]] os.rename(merge_list[0], file_name) continue # Merge smaller files into one large file. Read row groups from each file and add them to the new file. schema = pq.read_schema(file) print("Saving to new parquet file") writer = pq.ParquetWriter(temp_file, schema=schema, use_dictionary=True, compression='snappy') for merge in merge_list: parquet_file = pq.ParquetFile(merge) print("Writing " + merge + " to new parquet file") for i in range(parquet_file.num_row_groups): writer.write_table(parquet_file.read_row_group(i)) del files_dict[merge] os.remove(merge) writer.close() os.rename(temp_file, file_name) -Original Message- From: Jiayuan Chen Sent: Monday, December 10, 2018 2:30 PM To: dev@parquet.apache.org Subject: parquet-arrow estimate file size External Email: Use caution with links and attachments Hello, I am a Parquet developer in the Bay Area, and I am writing this email to seek precious help on writing Parquet file from Arrow. My goal is to control the size (in bytes) of the output Parquet file when writing from existing arrow table. I saw a reply in 2017 on this StackOverflow post ( https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs=) and wondering if the following implementation is currently possible: Feed data into the Arrow table, until at a point that the buffered data can be converted to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and then use WriteTable() to create such Parquet file. I saw that parquet-cpp recently introduced API to control the column writer's size in bytes in the low-level API, but seems this is still not yet available for the arrow-parquet API. Would this be in the roadmap? Thanks, Jiayuan This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.
RE: parquet-arrow estimate file size
Here's some sample code: import os import pyarrow.parquet as pq import glob as glob max_target_size = 134217728 target_size = max_target_size * .95 # Directory where parquet files are saved working_directory = '/tmp/test' files_dict = dict() files = glob.glob(os.path.join(working_directory, "*.parquet")) files.sort() for file in files: files_dict[file] = os.path.getsize(file) print("Merging parquet files") temp_file = os.path.join(working_directory, "temp.parquet") file_no = 0 for file in files: if file in files_dict: file_no = file_no + 1 file_name = os.path.join(working_directory, str(file_no).zfill(4) + ".parquet") print("Saving to parquet file " + file_name) # Just rename file if the file size is in target range if files_dict[file] > target_size: del files_dict[file] os.rename(file, file_name) continue merge_list = list() file_size = 0 # Find files to merge together which add up to less than 128 megs for k, v in files_dict.items(): if file_size + v <= max_target_size: print("Adding file " + k + " to merge list") merge_list.append(k) file_size = file_size + v # Just rename file if there is only one file to merge if len(merge_list) == 1: del files_dict[merge_list[0]] os.rename(merge_list[0], file_name) continue # Merge smaller files into one large file. Read row groups from each file and add them to the new file. schema = pq.read_schema(file) print("Saving to new parquet file") writer = pq.ParquetWriter(temp_file, schema=schema, use_dictionary=True, compression='snappy') for merge in merge_list: parquet_file = pq.ParquetFile(merge) print("Writing " + merge + " to new parquet file") for i in range(parquet_file.num_row_groups): writer.write_table(parquet_file.read_row_group(i)) del files_dict[merge] os.remove(merge) writer.close() os.rename(temp_file, file_name) -Original Message- From: Jiayuan Chen Sent: Monday, December 10, 2018 2:30 PM To: dev@parquet.apache.org Subject: parquet-arrow estimate file size External Email: Use caution with links and attachments Hello, I am a Parquet developer in the Bay Area, and I am writing this email to seek precious help on writing Parquet file from Arrow. My goal is to control the size (in bytes) of the output Parquet file when writing from existing arrow table. I saw a reply in 2017 on this StackOverflow post ( https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs=) and wondering if the following implementation is currently possible: Feed data into the Arrow table, until at a point that the buffered data can be converted to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and then use WriteTable() to create such Parquet file. I saw that parquet-cpp recently introduced API to control the column writer's size in bytes in the low-level API, but seems this is still not yet available for the arrow-parquet API. Would this be in the roadmap? Thanks, Jiayuan This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.
RE: parquet-arrow estimate file size
Here's my comment and how I'm generating 128 meg parquet files. This takes into account file sizes after compression and dictionary encoding. https://issues.apache.org/jira/browse/ARROW-3728?focusedCommentId=16703544=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16703544 Would be nice to have a merge() parquet file function that does something similar to create parquet files which match HDFS block sizes. -Original Message- From: Jiayuan Chen Sent: Monday, December 10, 2018 2:30 PM To: dev@parquet.apache.org Subject: parquet-arrow estimate file size External Email: Use caution with links and attachments Hello, I am a Parquet developer in the Bay Area, and I am writing this email to seek precious help on writing Parquet file from Arrow. My goal is to control the size (in bytes) of the output Parquet file when writing from existing arrow table. I saw a reply in 2017 on this StackOverflow post ( https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs=) and wondering if the following implementation is currently possible: Feed data into the Arrow table, until at a point that the buffered data can be converted to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and then use WriteTable() to create such Parquet file. I saw that parquet-cpp recently introduced API to control the column writer's size in bytes in the low-level API, but seems this is still not yet available for the arrow-parquet API. Would this be in the roadmap? Thanks, Jiayuan This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.