Re: parquet-arrow estimate file size

2018-12-11 Thread Jiayuan Chen
So seems like there is no solution to implement such mechanism using the
low-level API? I tried to dump the arrow::Buffer after each rowgroup is
completed, but looks like it is not a clear cut, that pages starting from
the second rowgroup became unreadable (the schema is correct tho).

If this solution does not exist, I will get back to the high level API that
uses a in-memory Arrow table then.




On Tue, Dec 11, 2018 at 8:17 AM Lee, David  wrote:

> In my experience and experiments it is really hard to approximate target
> sizes. A single parquet file with a single row group could be 20% larger
> than a parquet files with 20 row groups because if you have a lot of rows
> with a lot of data variety you can lose dictionary encoding options. I
> predetermine my row group sizes by creating them as files and then write
> them to a single parquet file.
>
> A better approach would probably be to write the row group to a single
> file and once the size exceeds your target size, remove the last row group
> written and start a new file with it, but I don't think there is a method
> to remove a row group right now.
>
> Another option would be to write the row group out as a file object in
> memory to predetermine its size before adding it as a row group in a
> parquet file.
>
>
> -Original Message-
> From: Wes McKinney 
> Sent: Tuesday, December 11, 2018 7:16 AM
> To: Parquet Dev 
> Subject: Re: parquet-arrow estimate file size
>
> External Email: Use caution with links and attachments
>
>
> hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to
> control or examine the size of files as they are being written.
> Ideally we would develop an interface to write a sequence of
> arrow::RecordBatch objects that would automatically move on to a new file
> once a certain approximate target size has been reached in an existing
> file. There's a number of moving parts that would need to be created to
> make this possible.
>
> - Wes
> On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal 
> wrote:
> >
> > I think if I've understood the problem correctly, you could use the
> > parquet::arrow::FileWriter
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> > _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128=DwIFaQ=zU
> > O0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=r
> > rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko=zQJ4skn8jLtkXiPTGWljgDyof
> > gJTKIAyAeCBwHQuamw=
> >
> > The basic pattern is to use an object to manage the FileWriter lifetime,
> call the WriteTable method for each row group, and close it when you are
> done.  My understanding is that each call to WriteTable will append a new
> row group which should allow you to incrementally write an out-of-memory
> dataset.  I realize now that I haven't tested this myself so it would be
> good to double-check this with someone more experienced with the
> parquet-cpp APIs.
> >
> > On 12/11/18, 12:54 AM, "Jiayuan Chen"  wrote:
> >
> > Thanks for the suggestion, will do.
> >
> > Since such high-level API is not yet implemented in the parquet-cpp
> > project, I have to turn back to use the API newly introduced in the
> > low-level API, that calculates the Parquet file size when adding
> data into
> > the column writers. I have another question on that part:
> >
> > Is there any sample code & advice that I can follow to be able to
> stream
> > the Parquet file on a per rowgroup basis? In order words, to restrict
> > memory usage but still create big enough Parquet file, I would like
> to
> > create relatively small rowgroup in memory using
> InMemoryOutputStream(),
> > and dump the buffer contents to my external stream, after completing
> each
> > row group, until a big file with several rowgroups is finished.
> However, my
> > attempt to manipulate the underline arrow::Buffer have failed, that
> the
> > pages starting from the second rowgroup are unreadable.
> >
> > Thanks!
> >
> > On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney 
> wrote:
> >
> > > hi Jiayuan,
> > >
> > > To your question
> > >
> > > > Would this be in the roadmap?
> > >
> > > I doubt there would be any objections to adding this feature to the
> > > Arrow writer API -- please feel free to open a JIRA issue to
> describe
> > > how the API might work in C++. Note there is no formal roadmap in
> this
> > > project.
> > >
> > > - Wes
> >  

Re: parquet-arrow estimate file size

2018-12-10 Thread Jiayuan Chen
Thanks for the suggestion, will do.

Since such high-level API is not yet implemented in the parquet-cpp
project, I have to turn back to use the API newly introduced in the
low-level API, that calculates the Parquet file size when adding data into
the column writers. I have another question on that part:

Is there any sample code & advice that I can follow to be able to stream
the Parquet file on a per rowgroup basis? In order words, to restrict
memory usage but still create big enough Parquet file, I would like to
create relatively small rowgroup in memory using InMemoryOutputStream(),
and dump the buffer contents to my external stream, after completing each
row group, until a big file with several rowgroups is finished. However, my
attempt to manipulate the underline arrow::Buffer have failed, that the
pages starting from the second rowgroup are unreadable.

Thanks!

On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney  wrote:

> hi Jiayuan,
>
> To your question
>
> > Would this be in the roadmap?
>
> I doubt there would be any objections to adding this feature to the
> Arrow writer API -- please feel free to open a JIRA issue to describe
> how the API might work in C++. Note there is no formal roadmap in this
> project.
>
> - Wes
> On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen  wrote:
> >
> > Thanks for the Python solution. However, is there a solution in C++ that
> I
> > can create such Parquet file with only in-memory buffer, using
> parquet-cpp
> > library?
> >
> > On Mon, Dec 10, 2018 at 3:23 PM Lee, David 
> wrote:
> >
> > > Resending.. Somehow I lost some line feeds in the previous reply..
> > >
> > > import os
> > > import pyarrow.parquet as pq
> > > import glob as glob
> > >
> > > max_target_size = 134217728
> > > target_size = max_target_size * .95
> > > # Directory where parquet files are saved
> > > working_directory = '/tmp/test'
> > > files_dict = dict()
> > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > > files.sort()
> > > for file in files:
> > > files_dict[file] = os.path.getsize(file)
> > > print("Merging parquet files")
> > > temp_file = os.path.join(working_directory, "temp.parquet")
> > > file_no = 0
> > > for file in files:
> > > if file in files_dict:
> > > file_no = file_no + 1
> > > file_name = os.path.join(working_directory,
> str(file_no).zfill(4)
> > > + ".parquet")
> > > print("Saving to parquet file " + file_name)
> > > # Just rename file if the file size is in target range
> > > if files_dict[file] > target_size:
> > > del files_dict[file]
> > > os.rename(file, file_name)
> > > continue
> > > merge_list = list()
> > > file_size = 0
> > > # Find files to merge together which add up to less than 128
> megs
> > > for k, v in files_dict.items():
> > > if file_size + v <= max_target_size:
> > > print("Adding file " + k + " to merge list")
> > > merge_list.append(k)
> > > file_size = file_size + v
> > > # Just rename file if there is only one file to merge
> > > if len(merge_list) == 1:
> > > del files_dict[merge_list[0]]
> > > os.rename(merge_list[0], file_name)
> > > continue
> > > # Merge smaller files into one large file. Read row groups from
> > > each file and add them to the new file.
> > > schema = pq.read_schema(file)
> > > print("Saving to new parquet file")
> > > writer = pq.ParquetWriter(temp_file, schema=schema,
> > > use_dictionary=True, compression='snappy')
> > > for merge in merge_list:
> > > parquet_file = pq.ParquetFile(merge)
> > > print("Writing " + merge + " to new parquet file")
> > > for i in range(parquet_file.num_row_groups):
> > > writer.write_table(parquet_file.read_row_group(i))
> > > del files_dict[merge]
> > > os.remove(merge)
> > > writer.close()
> > > os.rename(temp_file, file_name)
> > >
> > >
> > > -Original Message-
> > > From: Jiayuan Chen 
> > > Sent: Monday, December 10, 2018 2:30 PM
> > > To: de

Re: parquet-arrow estimate file size

2018-12-10 Thread Jiayuan Chen
Thanks for the Python solution. However, is there a solution in C++ that I
can create such Parquet file with only in-memory buffer, using parquet-cpp
library?

On Mon, Dec 10, 2018 at 3:23 PM Lee, David  wrote:

> Resending.. Somehow I lost some line feeds in the previous reply..
>
> import os
> import pyarrow.parquet as pq
> import glob as glob
>
> max_target_size = 134217728
> target_size = max_target_size * .95
> # Directory where parquet files are saved
> working_directory = '/tmp/test'
> files_dict = dict()
> files = glob.glob(os.path.join(working_directory, "*.parquet"))
> files.sort()
> for file in files:
> files_dict[file] = os.path.getsize(file)
> print("Merging parquet files")
> temp_file = os.path.join(working_directory, "temp.parquet")
> file_no = 0
> for file in files:
> if file in files_dict:
> file_no = file_no + 1
> file_name = os.path.join(working_directory, str(file_no).zfill(4)
> + ".parquet")
> print("Saving to parquet file " + file_name)
> # Just rename file if the file size is in target range
> if files_dict[file] > target_size:
> del files_dict[file]
> os.rename(file, file_name)
> continue
> merge_list = list()
> file_size = 0
> # Find files to merge together which add up to less than 128 megs
> for k, v in files_dict.items():
> if file_size + v <= max_target_size:
> print("Adding file " + k + " to merge list")
> merge_list.append(k)
> file_size = file_size + v
> # Just rename file if there is only one file to merge
> if len(merge_list) == 1:
> del files_dict[merge_list[0]]
> os.rename(merge_list[0], file_name)
> continue
> # Merge smaller files into one large file. Read row groups from
> each file and add them to the new file.
> schema = pq.read_schema(file)
> print("Saving to new parquet file")
> writer = pq.ParquetWriter(temp_file, schema=schema,
> use_dictionary=True, compression='snappy')
> for merge in merge_list:
> parquet_file = pq.ParquetFile(merge)
> print("Writing " + merge + " to new parquet file")
> for i in range(parquet_file.num_row_groups):
> writer.write_table(parquet_file.read_row_group(i))
> del files_dict[merge]
> os.remove(merge)
> writer.close()
> os.rename(temp_file, file_name)
>
>
> -Original Message-
> From: Jiayuan Chen 
> Sent: Monday, December 10, 2018 2:30 PM
> To: dev@parquet.apache.org
> Subject: parquet-arrow estimate file size
>
> External Email: Use caution with links and attachments
>
>
> Hello,
>
> I am a Parquet developer in the Bay Area, and I am writing this email to
> seek precious help on writing Parquet file from Arrow.
>
> My goal is to control the size (in bytes) of the output Parquet file when
> writing from existing arrow table. I saw a reply in 2017 on this
> StackOverflow post (
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs=
> )
> and wondering if the following implementation is currently possible: Feed
> data into the Arrow table, until at a point that the buffered data can be
> converted to a Parquet file (e.g. of size 256 MB, instead of a fix number
> of rows), and then use WriteTable() to create such Parquet file.
>
> I saw that parquet-cpp recently introduced API to control the column
> writer's size in bytes in the low-level API, but seems this is still not
> yet available for the arrow-parquet API. Would this be in the roadmap?
>
> Thanks,
> Jiayuan
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2018 BlackRock, Inc. All rights reserved.
>


parquet-arrow estimate file size

2018-12-10 Thread Jiayuan Chen
Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to
seek precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when
writing from existing arrow table. I saw a reply in 2017 on this
StackOverflow post (
https://stackoverflow.com/questions/45572962/how-can-i-write-streaming-row-oriented-data-using-parquet-cpp-without-buffering)
and wondering if the following implementation is currently possible: Feed
data into the Arrow table, until at a point that the buffered data can be
converted to a Parquet file (e.g. of size 256 MB, instead of a fix number
of rows), and then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column
writer's size in bytes in the low-level API, but seems this is still not
yet available for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan