Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/7210#issuecomment-123515998
@watermen Sorry for the late reply. I have two high level comments here:
1. Usually Spark SQL converts metastore Parquet tables to its native
Parquet relations and handles them in a more efficient way without using Hive
SerDe support. In this case, we are actually using our own customized input
format class, and simply ignores the input format you added here. This means,
`CombineParquetInputFormat` is only effective when users explicitly disable
metastore Hive table conversion.
2. For users who do need to deal with lots of small files, they have to set
the newly introduced configuration manually. This means users are aware of the
existence of those small files anyway. Then is it OK to just create the tables
with the combine input formats? Does this work for you? The benefit is that you
are using a standard Hive feature to solve the problem. As for
`CombineParquetInputFormat`, it's actually a pretty general utility, and can be
used out of the scope of Spark. I feel that it's worth to be made a separate
package (maybe a package in spark-packages?)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]