suremarc commented on issue #14281: URL: https://github.com/apache/datafusion/issues/14281#issuecomment-2613757948
> I've heard from another user that they managed to work around this by switching off the page index when generating the files by `parquet-go`. However, when I tried this, I still ran into this problem Hey, I'm the one who mentioned this in discord 😄 what I meant is that we disabled `datafusion.execution.parquet.enable_page_index` at query time, not that we skipped generating the page index when generating the parquet file. In your [repro repository](https://github.com/senyosimpson/fusion-repro) I was able to confirm disabling the page index makes it work: ```sql ❯ datafusion-cli DataFusion CLI v43.0.0 > select * from 'go-parquet-writer/go-testfile.parquet' where age > 10; External error: Parquet error: External: bad data > SET datafusion.execution.parquet.enable_page_index = false; 0 row(s) fetched. Elapsed 0.001 seconds. > select * from 'go-parquet-writer/go-testfile.parquet' where age > 10; +--------+---------+-----+-------+--------+--------------------------+---------+ | city | country | age | scale | status | time_captured | checked | +--------+---------+-----+-------+--------+--------------------------+---------+ | Athens | Greece | 32 | 1 | 20 | 2025-01-24T17:34:00.715Z | true | +--------+---------+-----+-------+--------+--------------------------+---------+ 1 row(s) fetched. Elapsed 0.021 seconds. ``` Last time I looked at this issue I had a feeling that this was an issue with `parquet-go`'s Thrift implementation but I wasn't able to find evidence, or tbh I also just don't remember since it's been quite some time... It's possible that pyarrow and pandas work if they aren't utilizing the page index for predicate pushdown. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org