Re: parquet-arrow estimate file size

Hatem Helal Tue, 11 Dec 2018 00:55:00 -0800

I think if I've understood the problem correctly, you could use the 
parquet::arrow::FileWriter


https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128

The basic pattern is to use an object to manage the FileWriter lifetime, call 
the WriteTable method for each row group, and close it when you are done.  My 
understanding is that each call to WriteTable will append a new row group which 
should allow you to incrementally write an out-of-memory dataset.  I realize 
now that I haven't tested this myself so it would be good to double-check this 
with someone more experienced with the parquet-cpp APIs.

On 12/11/18, 12:54 AM, "Jiayuan Chen" <hamt...@gmail.com> wrote:

    Thanks for the suggestion, will do.
    
    Since such high-level API is not yet implemented in the parquet-cpp
    project, I have to turn back to use the API newly introduced in the
    low-level API, that calculates the Parquet file size when adding data into
    the column writers. I have another question on that part:
    
    Is there any sample code & advice that I can follow to be able to stream
    the Parquet file on a per rowgroup basis? In order words, to restrict
    memory usage but still create big enough Parquet file, I would like to
    create relatively small rowgroup in memory using InMemoryOutputStream(),
    and dump the buffer contents to my external stream, after completing each
    row group, until a big file with several rowgroups is finished. However, my
    attempt to manipulate the underline arrow::Buffer have failed, that the
    pages starting from the second rowgroup are unreadable.
    
    Thanks!
    
    On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <wesmck...@gmail.com> wrote:
    
    > hi Jiayuan,
    >
    > To your question
    >
    > > Would this be in the roadmap?
    >
    > I doubt there would be any objections to adding this feature to the
    > Arrow writer API -- please feel free to open a JIRA issue to describe
    > how the API might work in C++. Note there is no formal roadmap in this
    > project.
    >
    > - Wes
    > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <hamt...@gmail.com> wrote:
    > >
    > > Thanks for the Python solution. However, is there a solution in C++ that
    > I
    > > can create such Parquet file with only in-memory buffer, using
    > parquet-cpp
    > > library?
    > >
    > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <david....@blackrock.com>
    > wrote:
    > >
    > > > Resending.. Somehow I lost some line feeds in the previous reply..
    > > >
    > > > import os
    > > > import pyarrow.parquet as pq
    > > > import glob as glob
    > > >
    > > > max_target_size = 134217728
    > > > target_size = max_target_size * .95
    > > > # Directory where parquet files are saved
    > > > working_directory = '/tmp/test'
    > > > files_dict = dict()
    > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
    > > > files.sort()
    > > > for file in files:
    > > >     files_dict[file] = os.path.getsize(file)
    > > > print("Merging parquet files")
    > > > temp_file = os.path.join(working_directory, "temp.parquet")
    > > > file_no = 0
    > > > for file in files:
    > > >     if file in files_dict:
    > > >         file_no = file_no + 1
    > > >         file_name = os.path.join(working_directory,
    > str(file_no).zfill(4)
    > > > + ".parquet")
    > > >         print("Saving to parquet file " + file_name)
    > > >         # Just rename file if the file size is in target range
    > > >         if files_dict[file] > target_size:
    > > >             del files_dict[file]
    > > >             os.rename(file, file_name)
    > > >             continue
    > > >         merge_list = list()
    > > >         file_size = 0
    > > >         # Find files to merge together which add up to less than 128
    > megs
    > > >         for k, v in files_dict.items():
    > > >             if file_size + v <= max_target_size:
    > > >                 print("Adding file " + k + " to merge list")
    > > >                 merge_list.append(k)
    > > >                 file_size = file_size + v
    > > >         # Just rename file if there is only one file to merge
    > > >         if len(merge_list) == 1:
    > > >             del files_dict[merge_list[0]]
    > > >             os.rename(merge_list[0], file_name)
    > > >             continue
    > > >         # Merge smaller files into one large file. Read row groups 
from
    > > > each file and add them to the new file.
    > > >         schema = pq.read_schema(file)
    > > >         print("Saving to new parquet file")
    > > >         writer = pq.ParquetWriter(temp_file, schema=schema,
    > > > use_dictionary=True, compression='snappy')
    > > >         for merge in merge_list:
    > > >             parquet_file = pq.ParquetFile(merge)
    > > >             print("Writing " + merge + " to new parquet file")
    > > >             for i in range(parquet_file.num_row_groups):
    > > >                 writer.write_table(parquet_file.read_row_group(i))
    > > >             del files_dict[merge]
    > > >             os.remove(merge)
    > > >         writer.close()
    > > >         os.rename(temp_file, file_name)
    > > >
    > > >
    > > > -----Original Message-----
    > > > From: Jiayuan Chen <hamt...@gmail.com>
    > > > Sent: Monday, December 10, 2018 2:30 PM
    > > > To: dev@parquet.apache.org
    > > > Subject: parquet-arrow estimate file size
    > > >
    > > > External Email: Use caution with links and attachments
    > > >
    > > >
    > > > Hello,
    > > >
    > > > I am a Parquet developer in the Bay Area, and I am writing this email
    > to
    > > > seek precious help on writing Parquet file from Arrow.
    > > >
    > > > My goal is to control the size (in bytes) of the output Parquet file
    > when
    > > > writing from existing arrow table. I saw a reply in 2017 on this
    > > > StackOverflow post (
    > > >
    > > >
    > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
    > > > )
    > > > and wondering if the following implementation is currently possible:
    > Feed
    > > > data into the Arrow table, until at a point that the buffered data can
    > be
    > > > converted to a Parquet file (e.g. of size 256 MB, instead of a fix
    > number
    > > > of rows), and then use WriteTable() to create such Parquet file.
    > > >
    > > > I saw that parquet-cpp recently introduced API to control the column
    > > > writer's size in bytes in the low-level API, but seems this is still
    > not
    > > > yet available for the arrow-parquet API. Would this be in the roadmap?
    > > >
    > > > Thanks,
    > > > Jiayuan
    > > >
    > > >
    > > > This message may contain information that is confidential or
    > privileged.
    > > > If you are not the intended recipient, please advise the sender
    > immediately
    > > > and delete this message. See
    > > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
    > > > further information.  Please refer to
    > > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
    > > > information about BlackRock’s Privacy Policy.
    > > >
    > > > For a list of BlackRock's office addresses worldwide, see
    > > > http://www.blackrock.com/corporate/about-us/contacts-locations.
    > > >
    > > > © 2018 BlackRock, Inc. All rights reserved.
    > > >
    >

Re: parquet-arrow estimate file size

Reply via email to