[jira] [Updated] (ARROW-4542) Denominate row group size in bytes (not in no of rows)

Remek Zajac (JIRA) Tue, 12 Feb 2019 06:57:31 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Remek Zajac updated ARROW-4542:
-------------------------------
    Description: 
Both the C++ [implementation of parquet writer for 
arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1174]
 and the [Python code bound to 
it|https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L911]
 appears denominated in the *number of rows* (without making it very explicit). 
Whereas:

(1) [The Apache parquet 
documentation|https://parquet.apache.org/documentation/latest/] states: 

"_Row group size: Larger row groups allow for larger column chunks which makes 
it possible to do larger sequential IO. Larger groups also require more 
buffering in the write path (or a two pass write). *We recommend large row 
groups (512MB - 1GB)*. Since an entire row group might need to be read, we want 
it to completely fit on one HDFS block. Therefore, HDFS block sizes should also 
be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS 
block size, 1 HDFS block per HDFS file._"

(2) Reference Apache [parquet-mr 
implementation|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L146]
 for Java accepts the row size expressed in bytes.

(3) The [low-level parquet read-write 
example|https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer2.cc#L88]
 also considers row group be denominated in bytes.

These insights make me conclude that:
 * Per parquet design and to take advantage of HDFS block level operations, it 
only makes sense to work with row group sizes as expressed in bytes - as that 
is the only consequential desire the caller can utter and want to influence.
 * Arrow implementation of ParquetWriter would benefit from re-nominating its 
`row_group_size` into bytes. I will also note it is impossible to use pyarrow 
to shape equally byte-sized row groups as the size the row group takes is 
post-compression and the caller only know how much uncompressed data they have 
managed to put in.

Now, my conclusions can be wrong and I may be blind to some alley of reasoning, 
so this ticket is more of a question than a bug. A question on whether the 
audience here agrees with my reasoning and if not - to explain what detail I 
have missed.

 

 

 

  was:
Both the C++ [implementation of parquet writer for 
arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1174]
 and the [Python code bound to 
it|https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L911]
 appears denominated in the *number of rows* (without making it very explicit). 
Whereas:

(1) [The Apache parquet 
documentation|https://parquet.apache.org/documentation/latest/] states: 

"_Row group size: Larger row groups allow for larger column chunks which makes 
it possible to do larger sequential IO. Larger groups also require more 
buffering in the write path (or a two pass write). *We recommend large row 
groups (512MB - 1GB)*. Since an entire row group might need to be read, we want 
it to completely fit on one HDFS block. Therefore, HDFS block sizes should also 
be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS 
block size, 1 HDFS block per HDFS file._"

(2) Reference Apache [parquet-mr 
implementation|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L146]
 for Java accepts the row size expressed in bytes.

(3) The [low-level parquet read-write 
example|https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer2.cc#L88]
 also considers row group be denominated in bytes.

These insights make me conclude that:
 * Per parquet design and to take advantage of HDFS block level operations, it 
only makes sense to work with row group sizes as expressed in bytes - as that 
is the only consequential desire the caller can express and want to influence.
 * Arrow implementation of ParquetWriter would benefit from re-nominating its 
`row_group_size` into bytes. I will also note it is impossible to use pyarrow 
to shape equally byte-sized row groups as the size the row group takes is 
post-compression and the caller only know how much uncompressed data they have 
managed to put in.

Now, my conclusions can be wrong and I may be blind to some alley of reasoning, 
so this ticket is more of a question than a bug. A question on whether the 
audience here agrees with my reasoning and if not - to explain what detail I 
have missed.

 

 

 


> Denominate row group size in bytes (not in no of rows)
> ------------------------------------------------------
>
>                 Key: ARROW-4542
>                 URL: https://issues.apache.org/jira/browse/ARROW-4542
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Remek Zajac
>            Priority: Major
>
> Both the C++ [implementation of parquet writer for 
> arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1174]
>  and the [Python code bound to 
> it|https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L911]
>  appears denominated in the *number of rows* (without making it very 
> explicit). Whereas:
> (1) [The Apache parquet 
> documentation|https://parquet.apache.org/documentation/latest/] states: 
> "_Row group size: Larger row groups allow for larger column chunks which 
> makes it possible to do larger sequential IO. Larger groups also require more 
> buffering in the write path (or a two pass write). *We recommend large row 
> groups (512MB - 1GB)*. Since an entire row group might need to be read, we 
> want it to completely fit on one HDFS block. Therefore, HDFS block sizes 
> should also be set to be larger. An optimized read setup would be: 1GB row 
> groups, 1GB HDFS block size, 1 HDFS block per HDFS file._"
> (2) Reference Apache [parquet-mr 
> implementation|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L146]
>  for Java accepts the row size expressed in bytes.
> (3) The [low-level parquet read-write 
> example|https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer2.cc#L88]
>  also considers row group be denominated in bytes.
> These insights make me conclude that:
>  * Per parquet design and to take advantage of HDFS block level operations, 
> it only makes sense to work with row group sizes as expressed in bytes - as 
> that is the only consequential desire the caller can utter and want to 
> influence.
>  * Arrow implementation of ParquetWriter would benefit from re-nominating its 
> `row_group_size` into bytes. I will also note it is impossible to use pyarrow 
> to shape equally byte-sized row groups as the size the row group takes is 
> post-compression and the caller only know how much uncompressed data they 
> have managed to put in.
> Now, my conclusions can be wrong and I may be blind to some alley of 
> reasoning, so this ticket is more of a question than a bug. A question on 
> whether the audience here agrees with my reasoning and if not - to explain 
> what detail I have missed.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-4542) Denominate row group size in bytes (not in no of rows)

Reply via email to