Github user wgtmac commented on the issue:
https://github.com/apache/spark/pull/15035
@HyukjinKwon This is not parquet specific, it applies to other data sources
as well.
1. Change the reading path for parquet: It does not solve the problem. Some
queries need to read all parquet files.
2. Make changes in row: yes, I have to change it per row because some
parquet files have int while some parquet files have long. We can't know which
row is good or problematic.
3. Vectorized parquet reader: This is a good catch. I haven't considered
this yet.
It would be great if you can come up with other good ideas and continue to
work on it. Feedbacks and discussions are welcome. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]