suremarc commented on issue #14281:
URL: https://github.com/apache/datafusion/issues/14281#issuecomment-2613757948

   > I've heard from another user that they managed to work around this by 
switching off the page index when generating the files by `parquet-go`. 
However, when I tried this, I still ran into this problem
   
   Hey, I'm the one who mentioned this in discord 😄 what I meant is that we 
disabled `datafusion.execution.parquet.enable_page_index` at query time, not 
that we skipped generating the page index when generating the parquet file.
   
   In your [repro repository](https://github.com/senyosimpson/fusion-repro) I 
was able to confirm disabling the page index makes it work:
   
   ```sql
   ❯ datafusion-cli
   DataFusion CLI v43.0.0
   > select * from 'go-parquet-writer/go-testfile.parquet' where age > 10;
   External error: Parquet error: External: bad data
   
   > SET datafusion.execution.parquet.enable_page_index = false;
   0 row(s) fetched. 
   Elapsed 0.001 seconds.
   
   > select * from 'go-parquet-writer/go-testfile.parquet' where age > 10;
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   | city   | country | age | scale | status | time_captured            | 
checked |
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   | Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true  
  |
   
+--------+---------+-----+-------+--------+--------------------------+---------+
   1 row(s) fetched. 
   Elapsed 0.021 seconds.
   ```
   
   Last time I looked at this issue I had a feeling that this was an issue with 
`parquet-go`'s Thrift implementation but I wasn't able to find evidence, or tbh 
I also just don't remember since it's been quite some time... It's possible 
that pyarrow and pandas work if they aren't utilizing the page index for 
predicate pushdown. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to