Here's some sample code:
import os
import pyarrow.parquet as pq
import glob as glob
max_target_size = 134217728
target_size = max_target_size * .95
# Directory where parquet files are saved
working_directory = '/tmp/test'
files_dict = dict()
files = glob.glob(os.path.join(working_directory, "*.parquet"))
files.sort()
for file in files:
files_dict[file] = os.path.getsize(file)
print("Merging parquet files")
temp_file = os.path.join(working_directory, "temp.parquet")
file_no = 0
for file in files:
if file in files_dict:
file_no = file_no + 1
file_name = os.path.join(working_directory, str(file_no).zfill(4) +
".parquet")
print("Saving to parquet file " + file_name)
# Just rename file if the file size is in target range
if files_dict[file] > target_size:
del files_dict[file]
os.rename(file, file_name)
continue
merge_list = list()
file_size = 0
# Find files to merge together which add up to less than 128 megs
for k, v in files_dict.items():
if file_size + v <= max_target_size:
print("Adding file " + k + " to merge list")
merge_list.append(k)
file_size = file_size + v
# Just rename file if there is only one file to merge
if len(merge_list) == 1:
del files_dict[merge_list[0]]
os.rename(merge_list[0], file_name)
continue
# Merge smaller files into one large file. Read row groups from each
file and add them to the new file.
schema = pq.read_schema(file)
print("Saving to new parquet file")
writer = pq.ParquetWriter(temp_file, schema=schema,
use_dictionary=True, compression='snappy')
for merge in merge_list:
parquet_file = pq.ParquetFile(merge)
print("Writing " + merge + " to new parquet file")
for i in range(parquet_file.num_row_groups):
writer.write_table(parquet_file.read_row_group(i))
del files_dict[merge]
os.remove(merge)
writer.close()
os.rename(temp_file, file_name)
-----Original Message-----
From: Jiayuan Chen <[email protected]>
Sent: Monday, December 10, 2018 2:30 PM
To: [email protected]
Subject: parquet-arrow estimate file size
External Email: Use caution with links and attachments
Hello,
I am a Parquet developer in the Bay Area, and I am writing this email to seek
precious help on writing Parquet file from Arrow.
My goal is to control the size (in bytes) of the output Parquet file when
writing from existing arrow table. I saw a reply in 2017 on this StackOverflow
post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=)
and wondering if the following implementation is currently possible: Feed data
into the Arrow table, until at a point that the buffered data can be converted
to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and
then use WriteTable() to create such Parquet file.
I saw that parquet-cpp recently introduced API to control the column writer's
size in bytes in the low-level API, but seems this is still not yet available
for the arrow-parquet API. Would this be in the roadmap?
Thanks,
Jiayuan
This message may contain information that is confidential or privileged. If you
are not the intended recipient, please advise the sender immediately and delete
this message. See
http://www.blackrock.com/corporate/compliance/email-disclaimers for further
information. Please refer to
http://www.blackrock.com/corporate/compliance/privacy-policy for more
information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/about-us/contacts-locations.
© 2018 BlackRock, Inc. All rights reserved.