tustvold opened a new issue #1976:
URL: https://github.com/apache/arrow-datafusion/issues/1976


   **Describe the bug**
   
   The parquet SQL benchmarks no longer run cleanly, in particular the 
following query returns an error
   
   ```
   select string_optional from t where dict_10_required = 'prefix#1' and 
dict_1000_required = 'prefix#1';
   ```
   
   ```
    Parquet argument error: Parquet error: 'block_size' must be a multiple of 
128, got 90") for files: [PartitionedFile { file_meta: FileMeta { sized_file: 
SizedFile { path: "/tmp/parquet_query_sql20TObt.parquet", size: 201093448 }, 
last_modified: Some(2022-03-10T12:17:51.953953953Z) }, partition_values: [] }]
   ```
   
   I suspected this related to https://github.com/apache/arrow-rs/pull/1284 
which was included in the 9.1 release of arrow, but rolling back to before this 
upgrade just alters the error message
   
   ```
   Parquet argument error: EOF: eof decoding byte array") for files: 
[PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: 
"/tmp/parquet_query_sqlg368pa.parquet", size: 200927005 }, last_modified: 
Some(2022-03-10T12:29:15.693863589Z) }, partition_values: [] }]
   ```
   
   It is unclear at this stage if the problem is that the encoder is writing 
gibberish, or if the code has introduced a bug in the decoder. Either way, we 
should have caught this upstream in arrow-rs, if it is an upstream bug.
   
   Unfortunately my go to tool of, use alternative tools has not thus far 
yielded fruit. I guess I need to go work out how to get spark running...
   
   ```
   >>> pq.read_table('/home/raphael/Downloads/borked.parquet', 
columns=['string_optional'])
   OSError: Not yet implemented: Unsupported encoding.
   
   >>> duckdb.query(f"select string_optional from 
'/home/raphael/Downloads/borked.parquet'").fetchall()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   RuntimeError: Unsupported page encoding
   ```
   
   **To Reproduce**
   
   Run the SQL benchmarks
   
   **Expected behavior**
   
   They run without errors
   
   **Additional context**
   
   There is a broader question that perhaps we should be running this benchmark 
suite as part of some nightly CI job or something, potentially relates to #1377
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to