Thanks for the Python solution. However, is there a solution in C++ that I can create such Parquet file with only in-memory buffer, using parquet-cpp library?
On Mon, Dec 10, 2018 at 3:23 PM Lee, David <[email protected]> wrote: > Resending.. Somehow I lost some line feeds in the previous reply.. > > import os > import pyarrow.parquet as pq > import glob as glob > > max_target_size = 134217728 > target_size = max_target_size * .95 > # Directory where parquet files are saved > working_directory = '/tmp/test' > files_dict = dict() > files = glob.glob(os.path.join(working_directory, "*.parquet")) > files.sort() > for file in files: > files_dict[file] = os.path.getsize(file) > print("Merging parquet files") > temp_file = os.path.join(working_directory, "temp.parquet") > file_no = 0 > for file in files: > if file in files_dict: > file_no = file_no + 1 > file_name = os.path.join(working_directory, str(file_no).zfill(4) > + ".parquet") > print("Saving to parquet file " + file_name) > # Just rename file if the file size is in target range > if files_dict[file] > target_size: > del files_dict[file] > os.rename(file, file_name) > continue > merge_list = list() > file_size = 0 > # Find files to merge together which add up to less than 128 megs > for k, v in files_dict.items(): > if file_size + v <= max_target_size: > print("Adding file " + k + " to merge list") > merge_list.append(k) > file_size = file_size + v > # Just rename file if there is only one file to merge > if len(merge_list) == 1: > del files_dict[merge_list[0]] > os.rename(merge_list[0], file_name) > continue > # Merge smaller files into one large file. Read row groups from > each file and add them to the new file. > schema = pq.read_schema(file) > print("Saving to new parquet file") > writer = pq.ParquetWriter(temp_file, schema=schema, > use_dictionary=True, compression='snappy') > for merge in merge_list: > parquet_file = pq.ParquetFile(merge) > print("Writing " + merge + " to new parquet file") > for i in range(parquet_file.num_row_groups): > writer.write_table(parquet_file.read_row_group(i)) > del files_dict[merge] > os.remove(merge) > writer.close() > os.rename(temp_file, file_name) > > > -----Original Message----- > From: Jiayuan Chen <[email protected]> > Sent: Monday, December 10, 2018 2:30 PM > To: [email protected] > Subject: parquet-arrow estimate file size > > External Email: Use caution with links and attachments > > > Hello, > > I am a Parquet developer in the Bay Area, and I am writing this email to > seek precious help on writing Parquet file from Arrow. > > My goal is to control the size (in bytes) of the output Parquet file when > writing from existing arrow table. I saw a reply in 2017 on this > StackOverflow post ( > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e= > ) > and wondering if the following implementation is currently possible: Feed > data into the Arrow table, until at a point that the buffered data can be > converted to a Parquet file (e.g. of size 256 MB, instead of a fix number > of rows), and then use WriteTable() to create such Parquet file. > > I saw that parquet-cpp recently introduced API to control the column > writer's size in bytes in the low-level API, but seems this is still not > yet available for the arrow-parquet API. Would this be in the roadmap? > > Thanks, > Jiayuan > > > This message may contain information that is confidential or privileged. > If you are not the intended recipient, please advise the sender immediately > and delete this message. See > http://www.blackrock.com/corporate/compliance/email-disclaimers for > further information. Please refer to > http://www.blackrock.com/corporate/compliance/privacy-policy for more > information about BlackRock’s Privacy Policy. > > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/about-us/contacts-locations. > > © 2018 BlackRock, Inc. All rights reserved. >
