[GitHub] [arrow] jorisvandenbossche commented on issue #37428: [Python][Parquet] Cannot read parquet with duplicate column names

via GitHub Thu, 31 Aug 2023 03:10:01 -0700


jorisvandenbossche commented on issue #37428:
URL: https://github.com/apache/arrow/issues/37428#issuecomment-1700752478


   I think long term, ideally, the new dataset API based reading would also 
support duplicate column names. I assume it will be hard to fully support that 
throughout a fully query in Acero, but at least supporting it in a Scan node so 
you can rename them afterwards would be useful. 
   
   One thing to note is that even if we would remove the 
`use_legacy_dataset=True` option in the near future, you can still use the 
single-file `pq.ParquetFile(..).read()` interface that does support this. 
   The difference in support is between the pure Parquet reader vs the 
Parquet-format Dataset reader. The unfortunate aspect from a user point of view 
is that the most used `pq.read_table` function mixes both cases, and because it 
has historically supported reading multiple files through the legacy 
pq.ParquetDataset, we updated `pq.read_table` to read using the new dataset 
API. But for some cases, the single-file reader actually works better (and 
duplicate column names is one such example, but eg selecting fields of nested 
columns is another).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #37428: [Python][Parquet] Cannot read parquet with duplicate column names

Reply via email to