[GitHub] spark pull request: [SPARK-8813][SQL] Support combine text/parquet...

liancheng Tue, 21 Jul 2015 17:25:32 -0700

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/7210#issuecomment-123515998
  
    @watermen Sorry for the late reply. I have two high level comments here:
    
    1. Usually Spark SQL converts metastore Parquet tables to its native 
Parquet relations and handles them in a more efficient way without using Hive 
SerDe support. In this case, we are actually using our own customized input 
format class, and simply ignores the input format you added here. This means, 
`CombineParquetInputFormat` is only effective when users explicitly disable 
metastore Hive table conversion.
    
    2. For users who do need to deal with lots of small files, they have to set 
the newly introduced configuration manually. This means users are aware of the 
existence of those small files anyway. Then is it OK to just create the tables 
with the combine input formats? Does this work for you? The benefit is that you 
are using a standard Hive feature to solve the problem. As for 
`CombineParquetInputFormat`, it's actually a pretty general utility, and can be 
used out of the scope of Spark. I feel that it's worth to be made a separate 
package (maybe a package in spark-packages?)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8813][SQL] Support combine text/parquet...

Reply via email to