Re: parquet-arrow estimate file size

Jiayuan Chen Mon, 10 Dec 2018 16:54:38 -0800

Thanks for the suggestion, will do.

Since such high-level API is not yet implemented in the parquet-cpp
project, I have to turn back to use the API newly introduced in the
low-level API, that calculates the Parquet file size when adding data into
the column writers. I have another question on that part:


Is there any sample code & advice that I can follow to be able to stream
the Parquet file on a per rowgroup basis? In order words, to restrict
memory usage but still create big enough Parquet file, I would like to
create relatively small rowgroup in memory using InMemoryOutputStream(),
and dump the buffer contents to my external stream, after completing each
row group, until a big file with several rowgroups is finished. However, my
attempt to manipulate the underline arrow::Buffer have failed, that the
pages starting from the second rowgroup are unreadable.

Thanks!

On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <[email protected]> wrote:

> hi Jiayuan,
>
> To your question
>
> > Would this be in the roadmap?
>
> I doubt there would be any objections to adding this feature to the
> Arrow writer API -- please feel free to open a JIRA issue to describe
> how the API might work in C++. Note there is no formal roadmap in this
> project.
>
> - Wes
> On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <[email protected]> wrote:
> >
> > Thanks for the Python solution. However, is there a solution in C++ that
> I
> > can create such Parquet file with only in-memory buffer, using
> parquet-cpp
> > library?
> >
> > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <[email protected]>
> wrote:
> >
> > > Resending.. Somehow I lost some line feeds in the previous reply..
> > >
> > > import os
> > > import pyarrow.parquet as pq
> > > import glob as glob
> > >
> > > max_target_size = 134217728
> > > target_size = max_target_size * .95
> > > # Directory where parquet files are saved
> > > working_directory = '/tmp/test'
> > > files_dict = dict()
> > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > > files.sort()
> > > for file in files:
> > >     files_dict[file] = os.path.getsize(file)
> > > print("Merging parquet files")
> > > temp_file = os.path.join(working_directory, "temp.parquet")
> > > file_no = 0
> > > for file in files:
> > >     if file in files_dict:
> > >         file_no = file_no + 1
> > >         file_name = os.path.join(working_directory,
> str(file_no).zfill(4)
> > > + ".parquet")
> > >         print("Saving to parquet file " + file_name)
> > >         # Just rename file if the file size is in target range
> > >         if files_dict[file] > target_size:
> > >             del files_dict[file]
> > >             os.rename(file, file_name)
> > >             continue
> > >         merge_list = list()
> > >         file_size = 0
> > >         # Find files to merge together which add up to less than 128
> megs
> > >         for k, v in files_dict.items():
> > >             if file_size + v <= max_target_size:
> > >                 print("Adding file " + k + " to merge list")
> > >                 merge_list.append(k)
> > >                 file_size = file_size + v
> > >         # Just rename file if there is only one file to merge
> > >         if len(merge_list) == 1:
> > >             del files_dict[merge_list[0]]
> > >             os.rename(merge_list[0], file_name)
> > >             continue
> > >         # Merge smaller files into one large file. Read row groups from
> > > each file and add them to the new file.
> > >         schema = pq.read_schema(file)
> > >         print("Saving to new parquet file")
> > >         writer = pq.ParquetWriter(temp_file, schema=schema,
> > > use_dictionary=True, compression='snappy')
> > >         for merge in merge_list:
> > >             parquet_file = pq.ParquetFile(merge)
> > >             print("Writing " + merge + " to new parquet file")
> > >             for i in range(parquet_file.num_row_groups):
> > >                 writer.write_table(parquet_file.read_row_group(i))
> > >             del files_dict[merge]
> > >             os.remove(merge)
> > >         writer.close()
> > >         os.rename(temp_file, file_name)
> > >
> > >
> > > -----Original Message-----
> > > From: Jiayuan Chen <[email protected]>
> > > Sent: Monday, December 10, 2018 2:30 PM
> > > To: [email protected]
> > > Subject: parquet-arrow estimate file size
> > >
> > > External Email: Use caution with links and attachments
> > >
> > >
> > > Hello,
> > >
> > > I am a Parquet developer in the Bay Area, and I am writing this email
> to
> > > seek precious help on writing Parquet file from Arrow.
> > >
> > > My goal is to control the size (in bytes) of the output Parquet file
> when
> > > writing from existing arrow table. I saw a reply in 2017 on this
> > > StackOverflow post (
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
> > > )
> > > and wondering if the following implementation is currently possible:
> Feed
> > > data into the Arrow table, until at a point that the buffered data can
> be
> > > converted to a Parquet file (e.g. of size 256 MB, instead of a fix
> number
> > > of rows), and then use WriteTable() to create such Parquet file.
> > >
> > > I saw that parquet-cpp recently introduced API to control the column
> > > writer's size in bytes in the low-level API, but seems this is still
> not
> > > yet available for the arrow-parquet API. Would this be in the roadmap?
> > >
> > > Thanks,
> > > Jiayuan
> > >
> > >
> > > This message may contain information that is confidential or
> privileged.
> > > If you are not the intended recipient, please advise the sender
> immediately
> > > and delete this message. See
> > > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > > further information.  Please refer to
> > > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > > information about BlackRock’s Privacy Policy.
> > >
> > > For a list of BlackRock's office addresses worldwide, see
> > > http://www.blackrock.com/corporate/about-us/contacts-locations.
> > >
> > > © 2018 BlackRock, Inc. All rights reserved.
> > >
>

Re: parquet-arrow estimate file size

Reply via email to