RE: Concatenation of parquet files

2021-10-15 Thread Lee, David
Here was my solution back in 2018.. It's easier to do now with pyarrow's python 
APIs than Spark..

https://stackoverflow.com/questions/39187622/how-do-you-control-the-size-of-the-output-file/51216145#51216145

Read all the smaller files in your list one at a time and write them to the 
temp file as parquet ROW GROUP. It is very important to write each file in as a 
row group which preserves compression encoding and guarantees the amount of 
bytes (minus schema metadata) written will be the same as the original file 
size.

-Original Message-
From: Lee, David 
Sent: Friday, October 15, 2021 2:04 PM
To: dev@parquet.apache.org; 'emkornfi...@gmail.com' ; 
david@blackrock.com.invalid
Subject: RE: Concatenation of parquet files

Well this is right and wrong.. There is one footer, but the statistics are 
captured per row group which allows rowgroups to be easily concatenated into a 
new file without rebuiliding column stats.

The final file looks more like:

> > > ROW GROUP A1
> > > ROW GROUP A2
> > > ROW GROUP B1
> > > ROW GROUP B2
> > > FOOTER A1, A2, B1, B2

http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/

When all the row groups are written and before the closing the file the Parquet 
writer adds the footer to the end of the file.

The footer includes the file schema (column names and their types) as well as 
details about every row group (total size, number of rows, min/max statistics, 
number of NULL values for every column). 

Note that this column statistics is per row group, not for the entire file.

-Original Message-
From: Micah Kornfield 
Sent: Friday, October 15, 2021 1:40 PM
To: david@blackrock.com.invalid
Cc: dev@parquet.apache.org
Subject: Re: Concatenation of parquet files

External Email: Use caution with links and attachments


Hi David,
I'm not sure I understand.  Concatenating files like this would likely break 
things.  In particular in the example:


> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2


There should only be one footer per file, otherwise, I don't think there is any 
means of discovering the A row groups.  Also, without rewriting metadata file 
offsets of B would be wrong ( 
https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$
).

https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> "We can similarly write a Parquet file with multiple row groups by 
> using ParquetWriter"


Multiple row groups are fine.  Combining them after the fact  by simple file 
concatenation (which is what i understand the original question to be) would 
yield incorrect results.  If you reread small files and write them out again in 
one pass, that would be fine.

Cheers,
Micah

On Fri, Oct 15, 2021 at 1:29 PM Lee, David 
wrote:

> Each row group should have its own statistics footer or dictionary.. 
> Your file structure should look like this:
>
> > > *contents of parquet file A:*
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > >
> > > *contents of parquet file B:*
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> I frequently concatenate smaller parquet files by appending rowgroups 
> until I hit an optimal 125 meg file size for HDFS.
>
>
> https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu
> et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9
> scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> "We can similarly write a Parquet file with multiple row groups by 
> using ParquetWriter"
>
> -Original Message-
> From: Pau Tallada 
> Sent: Tuesday, September 14, 2021 6:01 AM
> To: dev@parquet.apache.org
> Subject: Re: Concatenation of parquet files
>
> External Email: Use caution with links and attachments
>
>
> Dear Gabor,
>
> Thanks a lot for the clarification! ☺
> I understand this is not a common use case, I somewhat just had hope 
> it could be done easily :P
>
> If you are interested, I attach a collab notebook where it shows this 
> behaviour.

RE: Concatenation of parquet files

2021-10-15 Thread Lee, David
Well this is right and wrong.. There is one footer, but the statistics are 
captured per row group which allows rowgroups to be easily concatenated into a 
new file without rebuiliding column stats.

The final file looks more like:

> > > ROW GROUP A1
> > > ROW GROUP A2
> > > ROW GROUP B1
> > > ROW GROUP B2
> > > FOOTER A1, A2, B1, B2

http://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/

When all the row groups are written and before the closing the file the Parquet 
writer adds the footer to the end of the file.

The footer includes the file schema (column names and their types) as well as 
details about every row group (total size, number of rows, min/max statistics, 
number of NULL values for every column). 

Note that this column statistics is per row group, not for the entire file.

-Original Message-
From: Micah Kornfield  
Sent: Friday, October 15, 2021 1:40 PM
To: david@blackrock.com.invalid
Cc: dev@parquet.apache.org
Subject: Re: Concatenation of parquet files

External Email: Use caution with links and attachments


Hi David,
I'm not sure I understand.  Concatenating files like this would likely break 
things.  In particular in the example:


> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2


There should only be one footer per file, otherwise, I don't think there is any 
means of discovering the A row groups.  Also, without rewriting metadata file 
offsets of B would be wrong ( 
https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L790__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFUQMQxT14$
).

https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parquet.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> "We can similarly write a Parquet file with multiple row groups by 
> using ParquetWriter"


Multiple row groups are fine.  Combining them after the fact  by simple file 
concatenation (which is what i understand the original question to be) would 
yield incorrect results.  If you reread small files and write them out again in 
one pass, that would be fine.

Cheers,
Micah

On Fri, Oct 15, 2021 at 1:29 PM Lee, David 
wrote:

> Each row group should have its own statistics footer or dictionary.. 
> Your file structure should look like this:
>
> > > *contents of parquet file A:*
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > >
> > > *contents of parquet file B:*
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> Merged:
> > > ROW GROUP A1
> > > FOOTER A1
> > > ROW GROUP A2
> > > FOOTER A2
> > > ROW GROUP B1
> > > FOOTER B1
> > > ROW GROUP B2
> > > FOOTER B2
>
> I frequently concatenate smaller parquet files by appending rowgroups 
> until I hit an optimal 125 meg file size for HDFS.
>
>
> https://urldefense.com/v3/__https://arrow.apache.org/docs/python/parqu
> et.html*finer-grained-reading-and-writing__;Iw!!KSjYCgUGsB4!INGxroC5Q9
> scoC02spExuMiY-UGGa6F9mlnA-60rpFo3zVyAt4awQpe2iHFU3_tBajY$
> "We can similarly write a Parquet file with multiple row groups by 
> using ParquetWriter"
>
> -Original Message-
> From: Pau Tallada 
> Sent: Tuesday, September 14, 2021 6:01 AM
> To: dev@parquet.apache.org
> Subject: Re: Concatenation of parquet files
>
> External Email: Use caution with links and attachments
>
>
> Dear Gabor,
>
> Thanks a lot for the clarification! ☺
> I understand this is not a common use case, I somewhat just had hope 
> it could be done easily :P
>
> If you are interested, I attach a collab notebook where it shows this 
> behaviour. The same data written three times produces different binary 
> contents.
>
> https://urldefense.com/v3/__https://colab.research.google.com/drive/1z
> 7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe
> 2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$
>
> Thanks again and best regards,
>
> Pau
>
> Missatge de Gabor Szadovszky  del dia dt., 14 de set.
> 2021 a les 10:54:
>
> > Hi Pau,
> >
> > I guess attachments are not allowed in the apache lists so we cannot 
> > see the image.
> >
> > If the two row groups contain the very same data in the same order 
> > and encoded with the same encoding, compressed with the same codec I 
>

RE: Concatenation of parquet files

2021-10-15 Thread Lee, David
Each row group should have its own statistics footer or dictionary.. Your file 
structure should look like this:

> > *contents of parquet file A:*
> > ROW GROUP A1
> > FOOTER A1
> > ROW GROUP A2
> > FOOTER A2
> >
> > *contents of parquet file B:*
> > ROW GROUP B1
> > FOOTER B1
> > ROW GROUP B2
> > FOOTER B2

Merged:
> > ROW GROUP A1
> > FOOTER A1
> > ROW GROUP A2
> > FOOTER A2
> > ROW GROUP B1
> > FOOTER B1
> > ROW GROUP B2
> > FOOTER B2

I frequently concatenate smaller parquet files by appending rowgroups until I 
hit an optimal 125 meg file size for HDFS.

https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
"We can similarly write a Parquet file with multiple row groups by using 
ParquetWriter"

-Original Message-
From: Pau Tallada  
Sent: Tuesday, September 14, 2021 6:01 AM
To: dev@parquet.apache.org
Subject: Re: Concatenation of parquet files

External Email: Use caution with links and attachments


Dear Gabor,

Thanks a lot for the clarification! ☺
I understand this is not a common use case, I somewhat just had hope it could 
be done easily :P

If you are interested, I attach a collab notebook where it shows this 
behaviour. The same data written three times produces different binary contents.
https://urldefense.com/v3/__https://colab.research.google.com/drive/1z7VFeEagWk-YAfi4W1CioKUNh0OheQ9f?usp=sharing__;!!KSjYCgUGsB4!Jzx-9D-0Fe2aFLJ5YPThSjNeLFi-BGs-mr0kmvpew1AC2er-i3m1NCRGGRyXqWt1evQ$

Thanks again and best regards,

Pau

Missatge de Gabor Szadovszky  del dia dt., 14 de set.
2021 a les 10:54:

> Hi Pau,
>
> I guess attachments are not allowed in the apache lists so we cannot 
> see the image.
>
> If the two row groups contain the very same data in the same order and 
> encoded with the same encoding, compressed with the same codec I 
> think, they should be the same binary. I am not sure why you have 
> different binary streams for these row groups but if the proper data 
> can be decoded from both row groups I would not spend too much time on it.
>
> About merging row groups. It is a tough issue and far not that simple 
> as concatenating the row groups (files) and creating a new footer. 
> There are statistics in the footer that you have to take care about as 
> well as column indexes and bloom filters that are not part of the 
> footer and neither the row groups. (They are written in separate data 
> structures before the
> footer.)
> If you don't want to decode the row groups these statistics can be 
> updated (with the new offsets) as well as the new footer can be 
> created by reading the original footers only. The problem here is 
> creating such a parquet file is not very useful in most cases. Most of 
> the problems come from many small row groups (in small files) which 
> cannot be solved this way. To solve the small files problem we need to 
> merge the row groups and for that we need to decode the original data 
> so we can re-create the statistics (at least for bloom filters).
>
> Long story short, theoretically it is solvable but it is a feature we 
> haven't implemented properly so far.
>
> Cheers,
> Gabor
>
> On Tue, Sep 14, 2021 at 10:08 AM Pau Tallada  wrote:
>
> > Hi,
> >
> > I am a developer of cosmohub.pic.es, a science platform that 
> > provides interactive analysis and exploration of large scientific datasets.
> Working
> > with Hive, users are able to generate the subset of data they are 
> > interested in, and this result set is stored as a set of files. When
> users
> > want to download this dataset, we combine/concatenate all the files 
> > on-the-fly to generate a single stream that gets downloaded. Done 
> > right, this is very efficient, avoids materializing the combined 
> > file and the stream is even seekable so downloads can be resumed. We 
> > are able to do
> this
> > for csv.bz2 and FITS formats.
> >
> > I am trying to do the same with parquet. Looking at the format 
> > specification, it seems that it could be done by simply 
> > concatenating the binary blobs of the set of row groups and 
> > generating a new footer for the merged file. The problem is that the 
> > same data, written twice in the same file (in two row groups), is 
> > represented with some differences in the binary stream produced (see 
> > attached image). Why is the binary representation of a row group 
> > different if the data is the same? Is the order or position of a row group 
> > codified inside its metadata?
> >
> > I attach the image of a parquet file with the same data (a single 
> > integer column named 'c' with a single value 0) written twice, with 
> > at least two differences marked in red and blue.
> > [image: image.png]
> >
> >
> > A little diagram to show what I'm trying to accomplish:
> >
> > *contents of parquet file A:*
> > PAR1
> > ROW GROUP A1
> > ROW GROUP A2
> > FOOTER A
> >
> > *contents of parquet file B:*
> > PAR1
> > ROW GROUP B1
> > ROW GROUP B2
> > FOOTER B
> >
> > If I'm not mistaken, there is no metadata in each row 

RE: Parquet File Naming Convention Standards

2019-05-22 Thread Lee, David
I've tried the one row group per parquet file / block and I ran into couple 
problems with some observations..

1. The single row group would contain 30 million rows x 10 columns for data. 
This requires a lot more memory to write the file. Saving 10 row groups one at 
a time into a single parquet file cuts the max memory usage down to 3 million 
rows.

2. Dictionary encoding only works if the dictionary values do not exceed the 
reserved space in a parquet file. Each row group has its own reserved space for 
dictionary values. Once you exceed the reserved space then dictionary encoding 
isn't used which can lead to slower query performance and increase the overall 
storage needed by 10% or more.

3. I generally try to store 30 million cells of data per row group. 3 million 
by 10 columns or 10 million rows x 3 columns, etc..

-Original Message-
From: Tim Armstrong  
Sent: Wednesday, May 22, 2019 12:27 PM
To: Parquet Dev 
Subject: Re: Parquet File Naming Convention Standards

External Email: Use caution with links and attachments


Not reusing file names is generally a good idea - there are a bunch of 
interesting consistency issues, particularly on object stores, if you reuse 
file paths. This has come up for us with things like INSERT OVERWRITE in Hive, 
which tends to generate the same file names.

I think there's an interesting set of discussions to be had around best 
practices for file sizes and row group sizes.

One point is that a lot of big data frameworks schedule parallel work based on 
filesystem metadata only (i.e. file sizes and block sizes, if the filesystem 
has a concept of a block). If you have arbitrary parquet files this can break 
down in various ways - e.g. if you have a 1GB file, you have to guess what a 
good way to divide up the processing is. If there are fewer row groups than 
expected, you'll get skew and if there are more you'll lose out on parallelism. 
HDFS blocks were often a good way to do this, since a lot of writers aim for 
one row group per block, but Parquet files often come from a variety of sources 
and get munged in different ways, so the heuristic falls over  in various ways 
in some application. It's somewhat worse on object stores like S3, where there 
isn't a concept of a block, just whatever the writer and reader have configured 
- you really ideally want reader and writer block sizes to line up, but 
coordinating can be difficult for some workflows.

Working on Impala, I'm a bit biased towards larger blocks, because of the 
scheduling problems and also because of the extra overhead added with row 
groups - we end up needed to do extra I/O operations per row group, adding 
overhead (some of the overhead is inherent because the data you're reading is 
more fragmented, so of it is just our implementation).

On Wed, May 22, 2019 at 11:55 AM Brian Bowman  wrote:

>  Thanks for the info!
>
> HDFS is only one of many storage platforms (distributed or otherwise) 
> that SAS supports.  In general larger physical files (e.g. 100MB to 
> 1GB) with multiple RowGroups are also a good thing for our usage 
> cases.  I'm working to get our Parquet (C to C++ via libparquet.so) writer to 
> do this.
>
> -Brian
>
> On 5/22/19, 1:21 PM, "Lee, David"  wrote:
>
> EXTERNAL
>
> I'm not a big fan of this convention which is a Spark convention..
>
> A. The files should have at least "foo" in the name. Using PyArrow 
> I would create these files as foo.1.parquet, foo.2.parquet, etc..
> B. These files are around 3 megs each. For HDFS storage, files 
> should be sized to match the HDFS blocksize which is usually set at 
> 128 megs
> (default) or 256 megs, 512 megs, 1 gig, etc..
>
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.cloudera.com
> _blog_2009_02_the-2Dsmall-2Dfiles-2Dproblem_=DwIFaQ=zUO0BtkCe66yJv
> AZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Wsz97R5QSnh4U
> ivp0SuIfu3GIlO6rHwWLWE6O-Ib7ZE=7ppV9DET_wMkOjgvgATKUoIel_zxLOwnRDPET
> jrveyc=
>
> I usually take small parquet files and save them as parquet row 
> groups in a larger parquet file to match the HDFS blocksize.
>
> -Original Message-
> From: Brian Bowman 
> Sent: Wednesday, May 22, 2019 8:40 AM
> To: dev@parquet.apache.org
> Subject: Parquet File Naming Convention Standards
>
> External Email: Use caution with links and attachments
>
>
> All,
>
> Here is an example .parquet data set saved using pySpark where the 
> following files are members of directory: “foo.parquet”:
>
> -rw-r--r--1 sasbpb  r8 Mar 26 12:10 ._SUCCESS.crc
> -rw-r--r--1 sasbpb  r25632 Mar 26 12:10
> .part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
> -rw-r--r--1 sasbpb  r25356 Mar 26 12:10
> .part-1-b84abe50-a92b-4b2b-b011-3099

RE: Parquet File Naming Convention Standards

2019-05-22 Thread Lee, David
I'm not a big fan of this convention which is a Spark convention..

A. The files should have at least "foo" in the name. Using PyArrow I would 
create these files as foo.1.parquet, foo.2.parquet, etc..
B. These files are around 3 megs each. For HDFS storage, files should be sized 
to match the HDFS blocksize which is usually set at 128 megs (default) or 256 
megs, 512 megs, 1 gig, etc..

https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

I usually take small parquet files and save them as parquet row groups in a 
larger parquet file to match the HDFS blocksize.

-Original Message-
From: Brian Bowman  
Sent: Wednesday, May 22, 2019 8:40 AM
To: dev@parquet.apache.org
Subject: Parquet File Naming Convention Standards 

External Email: Use caution with links and attachments


All,

Here is an example .parquet data set saved using pySpark where the following 
files are members of directory: “foo.parquet”:

-rw-r--r--1 sasbpb  r8 Mar 26 12:10 ._SUCCESS.crc
-rw-r--r--1 sasbpb  r25632 Mar 26 12:10 
.part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r25356 Mar 26 12:10 
.part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r26300 Mar 26 12:10 
.part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r23728 Mar 26 12:10 
.part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet.crc
-rw-r--r--1 sasbpb  r0 Mar 26 12:10 _SUCCESS
-rw-r--r--1 sasbpb  r  3279617 Mar 26 12:10 
part-0-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3244105 Mar 26 12:10 
part-1-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3365039 Mar 26 12:10 
part-2-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet
-rw-r--r--1 sasbpb  r  3035960 Mar 26 12:10 
part-3-b84abe50-a92b-4b2b-b011-30990891fb83-c000.parquet


Questions:

  1.  Is this the “standard” for creating/saving a .parquet data set?
  2.  It appears that “84abe50-a92b-4b2b-b011-30990891fb83” is a UUID.  Is the 
format:
 part-fileSeq#-UUID.parquet or part-fileSeq#-UUID.parquet.crc an 
established convention?  Is this documented somewhere?
  3.  Is there a C++ class to create the CRC?


Thanks,


Brian


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2019 BlackRock, Inc. All rights reserved.


RE: parquet-arrow estimate file size

2018-12-11 Thread Lee, David
In my experience and experiments it is really hard to approximate target sizes. 
A single parquet file with a single row group could be 20% larger than a 
parquet files with 20 row groups because if you have a lot of rows with a lot 
of data variety you can lose dictionary encoding options. I predetermine my row 
group sizes by creating them as files and then write them to a single parquet 
file.

A better approach would probably be to write the row group to a single file and 
once the size exceeds your target size, remove the last row group written and 
start a new file with it, but I don't think there is a method to remove a row 
group right now.

Another option would be to write the row group out as a file object in memory 
to predetermine its size before adding it as a row group in a parquet file.


-Original Message-
From: Wes McKinney  
Sent: Tuesday, December 11, 2018 7:16 AM
To: Parquet Dev 
Subject: Re: parquet-arrow estimate file size

External Email: Use caution with links and attachments


hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to 
control or examine the size of files as they are being written.
Ideally we would develop an interface to write a sequence of arrow::RecordBatch 
objects that would automatically move on to a new file once a certain 
approximate target size has been reached in an existing file. There's a number 
of moving parts that would need to be created to make this possible.

- Wes
On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal  wrote:
>
> I think if I've understood the problem correctly, you could use the 
> parquet::arrow::FileWriter
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128=DwIFaQ=zU
> O0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=r
> rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko=zQJ4skn8jLtkXiPTGWljgDyof
> gJTKIAyAeCBwHQuamw=
>
> The basic pattern is to use an object to manage the FileWriter lifetime, call 
> the WriteTable method for each row group, and close it when you are done.  My 
> understanding is that each call to WriteTable will append a new row group 
> which should allow you to incrementally write an out-of-memory dataset.  I 
> realize now that I haven't tested this myself so it would be good to 
> double-check this with someone more experienced with the parquet-cpp APIs.
>
> On 12/11/18, 12:54 AM, "Jiayuan Chen"  wrote:
>
> Thanks for the suggestion, will do.
>
> Since such high-level API is not yet implemented in the parquet-cpp
> project, I have to turn back to use the API newly introduced in the
> low-level API, that calculates the Parquet file size when adding data into
> the column writers. I have another question on that part:
>
> Is there any sample code & advice that I can follow to be able to stream
> the Parquet file on a per rowgroup basis? In order words, to restrict
> memory usage but still create big enough Parquet file, I would like to
> create relatively small rowgroup in memory using InMemoryOutputStream(),
> and dump the buffer contents to my external stream, after completing each
> row group, until a big file with several rowgroups is finished. However, 
> my
> attempt to manipulate the underline arrow::Buffer have failed, that the
> pages starting from the second rowgroup are unreadable.
>
> Thanks!
>
> On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney  wrote:
>
> > hi Jiayuan,
> >
> > To your question
> >
> > > Would this be in the roadmap?
> >
> > I doubt there would be any objections to adding this feature to the
> > Arrow writer API -- please feel free to open a JIRA issue to describe
> > how the API might work in C++. Note there is no formal roadmap in this
> > project.
> >
> > - Wes
> > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen  wrote:
> > >
> > > Thanks for the Python solution. However, is there a solution in C++ 
> that
> > I
> > > can create such Parquet file with only in-memory buffer, using
> > parquet-cpp
> > > library?
> > >
> > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David 
> > wrote:
> > >
> > > > Resending.. Somehow I lost some line feeds in the previous reply..
> > > >
> > > > import os
> > > > import pyarrow.parquet as pq
> > > > import glob as glob
> > > >
> > > > max_target_size = 134217728
> > > > target_size = max_target_size * .95
> > > > # Directory where parquet files are sa

RE: parquet-arrow estimate file size

2018-12-10 Thread Lee, David
Resending.. Somehow I lost some line feeds in the previous reply..

import os
import pyarrow.parquet as pq
import glob as glob
 
max_target_size = 134217728
target_size = max_target_size * .95
# Directory where parquet files are saved
working_directory = '/tmp/test'
files_dict = dict()
files = glob.glob(os.path.join(working_directory, "*.parquet"))
files.sort()
for file in files:
files_dict[file] = os.path.getsize(file)
print("Merging parquet files")
temp_file = os.path.join(working_directory, "temp.parquet")
file_no = 0
for file in files:
if file in files_dict:
file_no = file_no + 1
file_name = os.path.join(working_directory, str(file_no).zfill(4) + 
".parquet")
print("Saving to parquet file " + file_name)
# Just rename file if the file size is in target range
if files_dict[file] > target_size:
del files_dict[file]
os.rename(file, file_name)
continue
merge_list = list()
file_size = 0
# Find files to merge together which add up to less than 128 megs
for k, v in files_dict.items():
if file_size + v <= max_target_size:
print("Adding file " + k + " to merge list")
merge_list.append(k)
file_size = file_size + v
# Just rename file if there is only one file to merge
if len(merge_list) == 1:
del files_dict[merge_list[0]]
os.rename(merge_list[0], file_name)
continue
# Merge smaller files into one large file. Read row groups from each 
file and add them to the new file.
schema = pq.read_schema(file)
print("Saving to new parquet file")
writer = pq.ParquetWriter(temp_file, schema=schema, 
use_dictionary=True, compression='snappy')
for merge in merge_list:
parquet_file = pq.ParquetFile(merge)
print("Writing " + merge + " to new parquet file")
for i in range(parquet_file.num_row_groups):
writer.write_table(parquet_file.read_row_group(i))
del files_dict[merge]
os.remove(merge)
writer.close()
os.rename(temp_file, file_name)


-Original Message-
From: Jiayuan Chen  
Sent: Monday, December 10, 2018 2:30 PM
To: dev@parquet.apache.org
Subject: parquet-arrow estimate file size

External Email: Use caution with links and attachments


Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to seek 
precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when 
writing from existing arrow table. I saw a reply in 2017 on this StackOverflow 
post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs=)
and wondering if the following implementation is currently possible: Feed data 
into the Arrow table, until at a point that the buffered data can be converted 
to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and 
then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column writer's 
size in bytes in the low-level API, but seems this is still not yet available 
for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.


RE: parquet-arrow estimate file size

2018-12-10 Thread Lee, David
Here's some sample code:

import os
import pyarrow.parquet as pq
import glob as glob
 
max_target_size = 134217728
target_size = max_target_size * .95
# Directory where parquet files are saved
working_directory = '/tmp/test'
files_dict = dict()
files = glob.glob(os.path.join(working_directory, "*.parquet"))
files.sort()
for file in files:
files_dict[file] = os.path.getsize(file)
print("Merging parquet files")
temp_file = os.path.join(working_directory, "temp.parquet")
file_no = 0
for file in files:
if file in files_dict:
file_no = file_no + 1
file_name = os.path.join(working_directory, str(file_no).zfill(4) + 
".parquet")
print("Saving to parquet file " + file_name)
# Just rename file if the file size is in target range
if files_dict[file] > target_size:
del files_dict[file]
os.rename(file, file_name)
continue
merge_list = list()
file_size = 0
# Find files to merge together which add up to less than 128 megs
for k, v in files_dict.items():
if file_size + v <= max_target_size:
print("Adding file " + k + " to merge list")
merge_list.append(k)
file_size = file_size + v
# Just rename file if there is only one file to merge
if len(merge_list) == 1:
del files_dict[merge_list[0]]
os.rename(merge_list[0], file_name)
continue
# Merge smaller files into one large file. Read row groups from each 
file and add them to the new file.
schema = pq.read_schema(file)
print("Saving to new parquet file")
writer = pq.ParquetWriter(temp_file, schema=schema, 
use_dictionary=True, compression='snappy')
for merge in merge_list:
parquet_file = pq.ParquetFile(merge)
print("Writing " + merge + " to new parquet file")
for i in range(parquet_file.num_row_groups):
writer.write_table(parquet_file.read_row_group(i))
del files_dict[merge]
os.remove(merge)
writer.close()
os.rename(temp_file, file_name)

-Original Message-
From: Jiayuan Chen  
Sent: Monday, December 10, 2018 2:30 PM
To: dev@parquet.apache.org
Subject: parquet-arrow estimate file size

External Email: Use caution with links and attachments


Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to seek 
precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when 
writing from existing arrow table. I saw a reply in 2017 on this StackOverflow 
post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs=)
and wondering if the following implementation is currently possible: Feed data 
into the Arrow table, until at a point that the buffered data can be converted 
to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and 
then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column writer's 
size in bytes in the low-level API, but seems this is still not yet available 
for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.


RE: parquet-arrow estimate file size

2018-12-10 Thread Lee, David
Here's my comment and how I'm generating 128 meg parquet files. This takes 
into account file sizes after compression and dictionary encoding.

https://issues.apache.org/jira/browse/ARROW-3728?focusedCommentId=16703544=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16703544

Would be nice to have a merge() parquet file function that does something 
similar to create parquet files which match HDFS block sizes.


-Original Message-
From: Jiayuan Chen  
Sent: Monday, December 10, 2018 2:30 PM
To: dev@parquet.apache.org
Subject: parquet-arrow estimate file size

External Email: Use caution with links and attachments


Hello,

I am a Parquet developer in the Bay Area, and I am writing this email to seek 
precious help on writing Parquet file from Arrow.

My goal is to control the size (in bytes) of the output Parquet file when 
writing from existing arrow table. I saw a reply in 2017 on this StackOverflow 
post (
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_45572962_how-2Dcan-2Di-2Dwrite-2Dstreaming-2Drow-2Doriented-2Ddata-2Dusing-2Dparquet-2Dcpp-2Dwithout-2Dbuffering=DwIBaQ=zUO0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=Xc94mwZKuRfKH1rBeBcZvo7wtImfqsvAjDalN4JxsOA=209MSzgWa7GsPhLJgGsYhcHCoTC59R4ksjIOYqklNPs=)
and wondering if the following implementation is currently possible: Feed data 
into the Arrow table, until at a point that the buffered data can be converted 
to a Parquet file (e.g. of size 256 MB, instead of a fix number of rows), and 
then use WriteTable() to create such Parquet file.

I saw that parquet-cpp recently introduced API to control the column writer's 
size in bytes in the low-level API, but seems this is still not yet available 
for the arrow-parquet API. Would this be in the roadmap?

Thanks,
Jiayuan


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.