Resending.. Somehow I lost some line feeds in the previous reply..

import os
import pyarrow.parquet as pq
import glob as glob
 
max_target_size = 134217728
target_size = max_target_size * .95
# Directory where parquet files are saved
working_directory = '/tmp/test'
files_dict = dict()
files = glob.glob(os.path.join(working_directory, "*.parquet"))
files.sort()
for file in files:
    files_dict[file] = os.path.getsize(file)
print("Merging parquet files")
temp_file = os.path.join(working_directory, "temp.parquet")
file_no = 0
for file in files:
    if file in files_dict:
        file_no = file_no + 1
        file_name = os.path.join(working_directory, str(file_no).zfill(4) + 
".parquet")
        print("Saving to parquet file " + file_name)
        # Just rename file if the file size is in target range
        if files_dict[file] > target_size:
            del files_dict[file]
            os.rename(file, file_name)
            continue
        merge_list = list()
        file_size = 0
        # Find files to merge together which add up to less than 128 megs
        for k, v in files_dict.items():
            if file_size + v <= max_target_size:
                print("Adding file " + k + " to merge list")
                merge_list.append(k)
                file_size = file_size + v
        # Just rename file if there is only one file to merge
        if len(merge_list) == 1:
            del files_dict[merge_list[0]]
            os.rename(merge_list[0], file_name)
            continue
        # Merge smaller files into one large file. Read row groups from each 
file and add them to the new file.
        schema = pq.read_schema(file)
        print("Saving to new parquet file")
        writer = pq.ParquetWriter(temp_file, schema=schema, 
use_dictionary=True, compression='snappy')
        for merge in merge_list:
            parquet_file = pq.ParquetFile(merge)
            print("Writing " + merge + " to new parquet file")
            for i in range(parquet_file.num_row_groups):
                writer.write_table(parquet_file.read_row_group(i))
            del files_dict[merge]
            os.remove(merge)
        writer.close()
        os.rename(temp_file, file_name)


-----Original Message-----
From: Jiayuan Chen <[email protected]> 
Sent: Monday, December 10, 2018 2:30 PM
To: [email protected]
Subject: parquet-arrow estimate file size

External Email: Use caution with links and attachments


Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to seek 
precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when 
writing from existing arrow table. I saw a reply in 2017 on this StackOverflow 
post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=)
and wondering if the following implementation is currently possible: Feed data 
into the Arrow table, until at a point that the buffered data can be converted 
to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and 
then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column writer's 
size in bytes in the low-level API, but seems this is still not yet available 
for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.

Reply via email to