jorisvandenbossche commented on issue #37428: URL: https://github.com/apache/arrow/issues/37428#issuecomment-1700752478
I think long term, ideally, the new dataset API based reading would also support duplicate column names. I assume it will be hard to fully support that throughout a fully query in Acero, but at least supporting it in a Scan node so you can rename them afterwards would be useful. One thing to note is that even if we would remove the `use_legacy_dataset=True` option in the near future, you can still use the single-file `pq.ParquetFile(..).read()` interface that does support this. The difference in support is between the pure Parquet reader vs the Parquet-format Dataset reader. The unfortunate aspect from a user point of view is that the most used `pq.read_table` function mixes both cases, and because it has historically supported reading multiple files through the legacy pq.ParquetDataset, we updated `pq.read_table` to read using the new dataset API. But for some cases, the single-file reader actually works better (and duplicate column names is one such example, but eg selecting fields of nested columns is another). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
