In my experience and experiments it is really hard to approximate target sizes. A single parquet file with a single row group could be 20% larger than a parquet files with 20 row groups because if you have a lot of rows with a lot of data variety you can lose dictionary encoding options. I predetermine my row group sizes by creating them as files and then write them to a single parquet file.
A better approach would probably be to write the row group to a single file and once the size exceeds your target size, remove the last row group written and start a new file with it, but I don't think there is a method to remove a row group right now. Another option would be to write the row group out as a file object in memory to predetermine its size before adding it as a row group in a parquet file. -----Original Message----- From: Wes McKinney <wesmck...@gmail.com> Sent: Tuesday, December 11, 2018 7:16 AM To: Parquet Dev <dev@parquet.apache.org> Subject: Re: parquet-arrow estimate file size External Email: Use caution with links and attachments hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to control or examine the size of files as they are being written. Ideally we would develop an interface to write a sequence of arrow::RecordBatch objects that would automatically move on to a new file once a certain approximate target size has been reached in an existing file. There's a number of moving parts that would need to be created to make this possible. - Wes On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal <hatem.he...@mathworks.co.uk> wrote: > > I think if I've understood the problem correctly, you could use the > parquet::arrow::FileWriter > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache > _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128&d=DwIFaQ&c=zU > O0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=r > rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko&s=zQJ4skn8jLtkXiPTGWljgDyof > gJTKIAyAeCBwHQuamw&e= > > The basic pattern is to use an object to manage the FileWriter lifetime, call > the WriteTable method for each row group, and close it when you are done. My > understanding is that each call to WriteTable will append a new row group > which should allow you to incrementally write an out-of-memory dataset. I > realize now that I haven't tested this myself so it would be good to > double-check this with someone more experienced with the parquet-cpp APIs. > > On 12/11/18, 12:54 AM, "Jiayuan Chen" <hamt...@gmail.com> wrote: > > Thanks for the suggestion, will do. > > Since such high-level API is not yet implemented in the parquet-cpp > project, I have to turn back to use the API newly introduced in the > low-level API, that calculates the Parquet file size when adding data into > the column writers. I have another question on that part: > > Is there any sample code & advice that I can follow to be able to stream > the Parquet file on a per rowgroup basis? In order words, to restrict > memory usage but still create big enough Parquet file, I would like to > create relatively small rowgroup in memory using InMemoryOutputStream(), > and dump the buffer contents to my external stream, after completing each > row group, until a big file with several rowgroups is finished. However, > my > attempt to manipulate the underline arrow::Buffer have failed, that the > pages starting from the second rowgroup are unreadable. > > Thanks! > > On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Jiayuan, > > > > To your question > > > > > Would this be in the roadmap? > > > > I doubt there would be any objections to adding this feature to the > > Arrow writer API -- please feel free to open a JIRA issue to describe > > how the API might work in C++. Note there is no formal roadmap in this > > project. > > > > - Wes > > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <hamt...@gmail.com> wrote: > > > > > > Thanks for the Python solution. However, is there a solution in C++ > that > > I > > > can create such Parquet file with only in-memory buffer, using > > parquet-cpp > > > library? > > > > > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <david....@blackrock.com> > > wrote: > > > > > > > Resending.. Somehow I lost some line feeds in the previous reply.. > > > > > > > > import os > > > > import pyarrow.parquet as pq > > > > import glob as glob > > > > > > > > max_target_size = 134217728 > > > > target_size = max_target_size * .95 > > > > # Directory where parquet files are saved > > > > working_directory = '/tmp/test' > > > > files_dict = dict() > > > > files = glob.glob(os.path.join(working_directory, "*.parquet")) > > > > files.sort() > > > > for file in files: > > > > files_dict[file] = os.path.getsize(file) > > > > print("Merging parquet files") > > > > temp_file = os.path.join(working_directory, "temp.parquet") > > > > file_no = 0 > > > > for file in files: > > > > if file in files_dict: > > > > file_no = file_no + 1 > > > > file_name = os.path.join(working_directory, > > str(file_no).zfill(4) > > > > + ".parquet") > > > > print("Saving to parquet file " + file_name) > > > > # Just rename file if the file size is in target range > > > > if files_dict[file] > target_size: > > > > del files_dict[file] > > > > os.rename(file, file_name) > > > > continue > > > > merge_list = list() > > > > file_size = 0 > > > > # Find files to merge together which add up to less than 128 > > megs > > > > for k, v in files_dict.items(): > > > > if file_size + v <= max_target_size: > > > > print("Adding file " + k + " to merge list") > > > > merge_list.append(k) > > > > file_size = file_size + v > > > > # Just rename file if there is only one file to merge > > > > if len(merge_list) == 1: > > > > del files_dict[merge_list[0]] > > > > os.rename(merge_list[0], file_name) > > > > continue > > > > # Merge smaller files into one large file. Read row groups > from > > > > each file and add them to the new file. > > > > schema = pq.read_schema(file) > > > > print("Saving to new parquet file") > > > > writer = pq.ParquetWriter(temp_file, schema=schema, > > > > use_dictionary=True, compression='snappy') > > > > for merge in merge_list: > > > > parquet_file = pq.ParquetFile(merge) > > > > print("Writing " + merge + " to new parquet file") > > > > for i in range(parquet_file.num_row_groups): > > > > writer.write_table(parquet_file.read_row_group(i)) > > > > del files_dict[merge] > > > > os.remove(merge) > > > > writer.close() > > > > os.rename(temp_file, file_name) > > > > > > > > > > > > -----Original Message----- > > > > From: Jiayuan Chen <hamt...@gmail.com> > > > > Sent: Monday, December 10, 2018 2:30 PM > > > > To: dev@parquet.apache.org > > > > Subject: parquet-arrow estimate file size > > > > > > > > External Email: Use caution with links and attachments > > > > > > > > > > > > Hello, > > > > > > > > I am a Parquet developer in the Bay Area, and I am writing this > email > > to > > > > seek precious help on writing Parquet file from Arrow. > > > > > > > > My goal is to control the size (in bytes) of the output Parquet file > > when > > > > writing from existing arrow table. I saw a reply in 2017 on this > > > > StackOverflow post ( > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e= > > > > ) > > > > and wondering if the following implementation is currently possible: > > Feed > > > > data into the Arrow table, until at a point that the buffered data > can > > be > > > > converted to a Parquet file (e.g. of size 256 MB, instead of a fix > > number > > > > of rows), and then use WriteTable() to create such Parquet file. > > > > > > > > I saw that parquet-cpp recently introduced API to control the column > > > > writer's size in bytes in the low-level API, but seems this is still > > not > > > > yet available for the arrow-parquet API. Would this be in the > roadmap? > > > > > > > > Thanks, > > > > Jiayuan > > > > > > > > > > > > This message may contain information that is confidential or > > privileged. > > > > If you are not the intended recipient, please advise the sender > > immediately > > > > and delete this message. See > > > > http://www.blackrock.com/corporate/compliance/email-disclaimers for > > > > further information. Please refer to > > > > http://www.blackrock.com/corporate/compliance/privacy-policy for > more > > > > information about BlackRock’s Privacy Policy. > > > > > > > > For a list of BlackRock's office addresses worldwide, see > > > > http://www.blackrock.com/corporate/about-us/contacts-locations. > > > > > > > > © 2018 BlackRock, Inc. All rights reserved. > > > > > > > >