sidred opened a new issue, #3530:
URL: https://github.com/apache/arrow-rs/issues/3530

   **Describe the bug**
   The row group total_byte_size currently written to the parquet file is the 
compressed size and not the uncompressed size as expected
   
   **To Reproduce**
   For example uk_cities_with_headers.csv converted to parquet with schema
   ```
   message uk_cities {
     required binary city (STRING);
     required double lat;
     required double lng;
   }
   ```
   shows the following stats
   ```
   $ java -jar parquet-tools-1.10.99.7.2.15.0-147.jar meta 
uk_cities_rust.parquet
   ....
   row group 1: RC:37 TS:1595 OFFSET:4
   
--------------------------------------------------------------------------------
   city:         BINARY SNAPPY DO:4 FPO:710 SZ:815/1115/1.37 VC:37 
ENC:PLAIN,RLE_DICTIONARY,RLE ST:[min: Aberdeen, Aberdeen City, UK, max: 
Worthing, West Sussex, UK, num_nulls not defined]
   lat:          DOUBLE SNAPPY DO:907 FPO:1224 SZ:390/383/0.98 VC:37 
ENC:PLAIN,RLE_DICTIONARY,RLE ST:[min: 50.376289, max: 57.653484, num_nulls not 
defined]
   lng:          DOUBLE SNAPPY DO:1349 FPO:1666 SZ:390/383/0.98 VC:37 
ENC:PLAIN,RLE_DICTIONARY,RLE ST:[min: -7.318268, max: 0.573453, num_nulls not 
defined]
   ```
   
   **Expected behavior**
   The total size is expected to be sum of the uncompressed column sizes 1115 + 
383 + 383 = 1881 and not compressed size 815 + 390 + 390 = 1595
   
   **Additional context**
   Same csv converted to parquet using python pyarrow shows
   ```
   java -jar parquet-tools-1.10.99.7.2.15.0-147.jar meta uk_cities_py.parquet
   row group 1: RC:37 TS:1945 OFFSET:4
   
--------------------------------------------------------------------------------
   city:         BINARY SNAPPY DO:4 FPO:713 SZ:826/1123/1.36 VC:37 
ENC:RLE,PLAIN,RLE_DICTIONARY ST:[min: Aberdeen, Aberdeen City, UK, max: 
Worthing, West Sussex, UK, num_nulls: 0]
   lat:          DOUBLE SNAPPY DO:941 FPO:1258 SZ:418/411/0.98 VC:37 
ENC:RLE,PLAIN,RLE_DICTIONARY ST:[min: 50.376289, max: 57.653484, num_nulls: 0]
   lng:          DOUBLE SNAPPY DO:1454 FPO:1771 SZ:418/411/0.98 VC:37 
ENC:RLE,PLAIN,RLE_DICTIONARY ST:[min: -7.318268, max: 0.573453, num_nulls: 0]
   ```
   Here the total size matches the columns uncompressed size
   1945 = 1123 + 411 + 411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to