etseidl commented on code in PR #9700:
URL: https://github.com/apache/arrow-rs/pull/9700#discussion_r3080951829


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -4827,6 +4895,48 @@ mod tests {
         assert_eq!(get_dict_page_size(col1_meta), 1024 * 1024 * 4);
     }
 
+    #[test]
+    fn test_dict_page_size_decided_by_compression_fallback() {

Review Comment:
   As a test, I saved the output from this and examined the sizing. Without the 
heuristic, the encoded size for col0 is 8658384 bytes (the default fallback 
mechanism kicked in after 7 pages). With the heuristic, col1 is 8391126 bytes, 
a savings of 3%. 
   
   I also modified the test to mod the index with 32767. In that instance, col1 
was still 8391126 bytes, but col0 was only 2231581, nearly 4X smaller.
   
   I know this is not entirely representative, but it does again point out the 
pitfalls of too simplistic an approach.
   
   Edit: I did a test of spark with the latter file (32k cardinality). By 
default, it opts to fallback for all pages, so the file is even larger. If I 
modify the global `parquet.page.row.count.limit` to 132000, it then opts for 
dictionary encoding as it should.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to