[GitHub] [arrow-rs] alamb edited a comment on issue #641: Incorrect min/max statistics for strings in parquet files

GitBox Sat, 31 Jul 2021 03:20:13 -0700


alamb edited a comment on issue #641:
URL: https://github.com/apache/arrow-rs/issues/641#issuecomment-890325580



   I have confirmed that the python parquet writer correctly stores 
`"tewksbury"` as the max in statistics
   
   Using this python script:
   ```python
   import pyarrow
   import pandas as pd
   
   data = [
       "andover",
       "reading",
       "bedford",
       "tewsbury",
       "lexington",
       "lawrence",
   ];
   
   df = pd.DataFrame(data, columns = ['city'])
   df.to_parquet('/tmp/test_python.parquet')
   ```
   
   `parquet-tools` then confirm the min/max are "andover"/"tewksbury" as 
expected:
   
   ```shell
   alamb@ip-192-168-0-133 /tmp % parquet-tools dump /tmp/test_python.parquet 
   parquet-tools dump /tmp/test_python.parquet 
   row group 0 
   
----------------------------------------------------------------------------------------------------------------------
   city:  BINARY SNAPPY DO:4 FPO:90 SZ:139/137/0.99 VC:6 
ENC:RLE,PLAIN,PLAIN_DICTIONARY ST:[min: andover, max:  [more]...
   
       city TV=6 RL=0 DL=1 DS: 6 DE:PLAIN_DICTIONARY
       
------------------------------------------------------------------------------------------------------------------
       page 0:                  DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY ST:[min: 
andover, max: tewsbury, num_nulls: 0] [more]... VC:6
   
   BINARY city 
   
----------------------------------------------------------------------------------------------------------------------
   *** row group 1 of 1, values 1 to 6 *** 
   value 1: R:0 D:1 V:andover
   value 2: R:0 D:1 V:reading
   value 3: R:0 D:1 V:bedford
   value 4: R:0 D:1 V:tewsbury
   value 5: R:0 D:1 V:lexington
   value 6: R:0 D:1 V:lawrence
   alamb@ip-192-168-0-133 /tmp % 
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb edited a comment on issue #641: Incorrect min/max statistics for strings in parquet files

Reply via email to