Re: parquet-arrow estimate file size

Jiayuan Chen Tue, 11 Dec 2018 09:24:15 -0800

So seems like there is no solution to implement such mechanism using the
low-level API? I tried to dump the arrow::Buffer after each rowgroup is
completed, but looks like it is not a clear cut, that pages starting from
the second rowgroup became unreadable (the schema is correct tho).


If this solution does not exist, I will get back to the high level API that
uses a in-memory Arrow table then.




On Tue, Dec 11, 2018 at 8:17 AM Lee, David <david....@blackrock.com> wrote:

> In my experience and experiments it is really hard to approximate target
> sizes. A single parquet file with a single row group could be 20% larger
> than a parquet files with 20 row groups because if you have a lot of rows
> with a lot of data variety you can lose dictionary encoding options. I
> predetermine my row group sizes by creating them as files and then write
> them to a single parquet file.
>
> A better approach would probably be to write the row group to a single
> file and once the size exceeds your target size, remove the last row group
> written and start a new file with it, but I don't think there is a method
> to remove a row group right now.
>
> Another option would be to write the row group out as a file object in
> memory to predetermine its size before adding it as a row group in a
> parquet file.
>
>
> -----Original Message-----
> From: Wes McKinney <wesmck...@gmail.com>
> Sent: Tuesday, December 11, 2018 7:16 AM
> To: Parquet Dev <dev@parquet.apache.org>
> Subject: Re: parquet-arrow estimate file size
>
> External Email: Use caution with links and attachments
>
>
> hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to
> control or examine the size of files as they are being written.
> Ideally we would develop an interface to write a sequence of
> arrow::RecordBatch objects that would automatically move on to a new file
> once a certain approximate target size has been reached in an existing
> file. There's a number of moving parts that would need to be created to
> make this possible.
>
> - Wes
> On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal <hatem.he...@mathworks.co.uk>
> wrote:
> >
> > I think if I've understood the problem correctly, you could use the
> > parquet::arrow::FileWriter
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> > _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128&d=DwIFaQ&c=zU
> > O0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=r
> > rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko&s=zQJ4skn8jLtkXiPTGWljgDyof
> > gJTKIAyAeCBwHQuamw&e=
> >
> > The basic pattern is to use an object to manage the FileWriter lifetime,
> call the WriteTable method for each row group, and close it when you are
> done.  My understanding is that each call to WriteTable will append a new
> row group which should allow you to incrementally write an out-of-memory
> dataset.  I realize now that I haven't tested this myself so it would be
> good to double-check this with someone more experienced with the
> parquet-cpp APIs.
> >
> > On 12/11/18, 12:54 AM, "Jiayuan Chen" <hamt...@gmail.com> wrote:
> >
> >     Thanks for the suggestion, will do.
> >
> >     Since such high-level API is not yet implemented in the parquet-cpp
> >     project, I have to turn back to use the API newly introduced in the
> >     low-level API, that calculates the Parquet file size when adding
> data into
> >     the column writers. I have another question on that part:
> >
> >     Is there any sample code & advice that I can follow to be able to
> stream
> >     the Parquet file on a per rowgroup basis? In order words, to restrict
> >     memory usage but still create big enough Parquet file, I would like
> to
> >     create relatively small rowgroup in memory using
> InMemoryOutputStream(),
> >     and dump the buffer contents to my external stream, after completing
> each
> >     row group, until a big file with several rowgroups is finished.
> However, my
> >     attempt to manipulate the underline arrow::Buffer have failed, that
> the
> >     pages starting from the second rowgroup are unreadable.
> >
> >     Thanks!
> >
> >     On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> >     > hi Jiayuan,
> >     >
> >     > To your question
> >     >
> >     > > Would this be in the roadmap?
> >     >
> >     > I doubt there would be any objections to adding this feature to the
> >     > Arrow writer API -- please feel free to open a JIRA issue to
> describe
> >     > how the API might work in C++. Note there is no formal roadmap in
> this
> >     > project.
> >     >
> >     > - Wes
> >     > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen <hamt...@gmail.com>
> wrote:
> >     > >
> >     > > Thanks for the Python solution. However, is there a solution in
> C++ that
> >     > I
> >     > > can create such Parquet file with only in-memory buffer, using
> >     > parquet-cpp
> >     > > library?
> >     > >
> >     > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <
> david....@blackrock.com>
> >     > wrote:
> >     > >
> >     > > > Resending.. Somehow I lost some line feeds in the previous
> reply..
> >     > > >
> >     > > > import os
> >     > > > import pyarrow.parquet as pq
> >     > > > import glob as glob
> >     > > >
> >     > > > max_target_size = 134217728
> >     > > > target_size = max_target_size * .95
> >     > > > # Directory where parquet files are saved
> >     > > > working_directory = '/tmp/test'
> >     > > > files_dict = dict()
> >     > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> >     > > > files.sort()
> >     > > > for file in files:
> >     > > >     files_dict[file] = os.path.getsize(file)
> >     > > > print("Merging parquet files")
> >     > > > temp_file = os.path.join(working_directory, "temp.parquet")
> >     > > > file_no = 0
> >     > > > for file in files:
> >     > > >     if file in files_dict:
> >     > > >         file_no = file_no + 1
> >     > > >         file_name = os.path.join(working_directory,
> >     > str(file_no).zfill(4)
> >     > > > + ".parquet")
> >     > > >         print("Saving to parquet file " + file_name)
> >     > > >         # Just rename file if the file size is in target range
> >     > > >         if files_dict[file] > target_size:
> >     > > >             del files_dict[file]
> >     > > >             os.rename(file, file_name)
> >     > > >             continue
> >     > > >         merge_list = list()
> >     > > >         file_size = 0
> >     > > >         # Find files to merge together which add up to less
> than 128
> >     > megs
> >     > > >         for k, v in files_dict.items():
> >     > > >             if file_size + v <= max_target_size:
> >     > > >                 print("Adding file " + k + " to merge list")
> >     > > >                 merge_list.append(k)
> >     > > >                 file_size = file_size + v
> >     > > >         # Just rename file if there is only one file to merge
> >     > > >         if len(merge_list) == 1:
> >     > > >             del files_dict[merge_list[0]]
> >     > > >             os.rename(merge_list[0], file_name)
> >     > > >             continue
> >     > > >         # Merge smaller files into one large file. Read row
> groups from
> >     > > > each file and add them to the new file.
> >     > > >         schema = pq.read_schema(file)
> >     > > >         print("Saving to new parquet file")
> >     > > >         writer = pq.ParquetWriter(temp_file, schema=schema,
> >     > > > use_dictionary=True, compression='snappy')
> >     > > >         for merge in merge_list:
> >     > > >             parquet_file = pq.ParquetFile(merge)
> >     > > >             print("Writing " + merge + " to new parquet file")
> >     > > >             for i in range(parquet_file.num_row_groups):
> >     > > >
>  writer.write_table(parquet_file.read_row_group(i))
> >     > > >             del files_dict[merge]
> >     > > >             os.remove(merge)
> >     > > >         writer.close()
> >     > > >         os.rename(temp_file, file_name)
> >     > > >
> >     > > >
> >     > > > -----Original Message-----
> >     > > > From: Jiayuan Chen <hamt...@gmail.com>
> >     > > > Sent: Monday, December 10, 2018 2:30 PM
> >     > > > To: dev@parquet.apache.org
> >     > > > Subject: parquet-arrow estimate file size
> >     > > >
> >     > > > External Email: Use caution with links and attachments
> >     > > >
> >     > > >
> >     > > > Hello,
> >     > > >
> >     > > > I am a Parquet developer in the Bay Area, and I am writing
> this email
> >     > to
> >     > > > seek precious help on writing Parquet file from Arrow.
> >     > > >
> >     > > > My goal is to control the size (in bytes) of the output
> Parquet file
> >     > when
> >     > > > writing from existing arrow table. I saw a reply in 2017 on
> this
> >     > > > StackOverflow post (
> >     > > >
> >     > > >
> >     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA&s=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs&e=
> >     > > > )
> >     > > > and wondering if the following implementation is currently
> possible:
> >     > Feed
> >     > > > data into the Arrow table, until at a point that the buffered
> data can
> >     > be
> >     > > > converted to a Parquet file (e.g. of size 256 MB, instead of a
> fix
> >     > number
> >     > > > of rows), and then use WriteTable() to create such Parquet
> file.
> >     > > >
> >     > > > I saw that parquet-cpp recently introduced API to control the
> column
> >     > > > writer's size in bytes in the low-level API, but seems this is
> still
> >     > not
> >     > > > yet available for the arrow-parquet API. Would this be in the
> roadmap?
> >     > > >
> >     > > > Thanks,
> >     > > > Jiayuan
> >     > > >
> >     > > >
> >     > > > This message may contain information that is confidential or
> >     > privileged.
> >     > > > If you are not the intended recipient, please advise the sender
> >     > immediately
> >     > > > and delete this message. See
> >     > > >
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> >     > > > further information.  Please refer to
> >     > > > http://www.blackrock.com/corporate/compliance/privacy-policy
> for more
> >     > > > information about BlackRock’s Privacy Policy.
> >     > > >
> >     > > > For a list of BlackRock's office addresses worldwide, see
> >     > > > http://www.blackrock.com/corporate/about-us/contacts-locations
> .
> >     > > >
> >     > > > © 2018 BlackRock, Inc. All rights reserved.
> >     > > >
> >     >
> >
> >
>

Re: parquet-arrow estimate file size

Reply via email to