Re: parquet-arrow estimate file size

Jiayuan Chen Mon, 10 Dec 2018 15:31:47 -0800

Thanks for the Python solution. However, is there a solution in C++ that I
can create such Parquet file with only in-memory buffer, using parquet-cpp
library?


On Mon, Dec 10, 2018 at 3:23 PM Lee, David <[email protected]> wrote:

> Resending.. Somehow I lost some line feeds in the previous reply..
>
> import os
> import pyarrow.parquet as pq
> import glob as glob
>
> max_target_size = 134217728
> target_size = max_target_size * .95
> # Directory where parquet files are saved
> working_directory = '/tmp/test'
> files_dict = dict()
> files = glob.glob(os.path.join(working_directory, "*.parquet"))
> files.sort()
> for file in files:
>     files_dict[file] = os.path.getsize(file)
> print("Merging parquet files")
> temp_file = os.path.join(working_directory, "temp.parquet")
> file_no = 0
> for file in files:
>     if file in files_dict:
>         file_no = file_no + 1
>         file_name = os.path.join(working_directory, str(file_no).zfill(4)
> + ".parquet")
>         print("Saving to parquet file " + file_name)
>         # Just rename file if the file size is in target range
>         if files_dict[file] > target_size:
>             del files_dict[file]
>             os.rename(file, file_name)
>             continue
>         merge_list = list()
>         file_size = 0
>         # Find files to merge together which add up to less than 128 megs
>         for k, v in files_dict.items():
>             if file_size + v <= max_target_size:
>                 print("Adding file " + k + " to merge list")
>                 merge_list.append(k)
>                 file_size = file_size + v
>         # Just rename file if there is only one file to merge
>         if len(merge_list) == 1:
>             del files_dict[merge_list[0]]
>             os.rename(merge_list[0], file_name)
>             continue
>         # Merge smaller files into one large file. Read row groups from
> each file and add them to the new file.
>         schema = pq.read_schema(file)
>         print("Saving to new parquet file")
>         writer = pq.ParquetWriter(temp_file, schema=schema,
> use_dictionary=True, compression='snappy')
>         for merge in merge_list:
>             parquet_file = pq.ParquetFile(merge)
>             print("Writing " + merge + " to new parquet file")
>             for i in range(parquet_file.num_row_groups):
>                 writer.write_table(parquet_file.read_row_group(i))
>             del files_dict[merge]
>             os.remove(merge)
>         writer.close()
>         os.rename(temp_file, file_name)
>
>
> -----Original Message-----
> From: Jiayuan Chen <[email protected]>
> Sent: Monday, December 10, 2018 2:30 PM
> To: [email protected]
> Subject: parquet-arrow estimate file size
>
> External Email: Use caution with links and attachments
>
>
> Hello,
>
> I am a Parquet developer in the Bay Area, and I am writing this email to
> seek precious help on writing Parquet file from Arrow.
>
> My goal is to control the size (in bytes) of the output Parquet file when
> writing from existing arrow table. I saw a reply in 2017 on this
> StackOverflow post (
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
> )
> and wondering if the following implementation is currently possible: Feed
> data into the Arrow table, until at a point that the buffered data can be
> converted to a Parquet file (e.g. of size 256 MB, instead of a fix number
> of rows), and then use WriteTable() to create such Parquet file.
>
> I saw that parquet-cpp recently introduced API to control the column
> writer's size in bytes in the low-level API, but seems this is still not
> yet available for the arrow-parquet API. Would this be in the roadmap?
>
> Thanks,
> Jiayuan
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2018 BlackRock, Inc. All rights reserved.
>

Re: parquet-arrow estimate file size

Reply via email to