Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1919#issuecomment-52509183
  
    @baishuo Thank you for working on it.
    
    I have three general comments.
    
    1. Hive has a lots of confs that are used to influence how semantic 
analyzer works. `HiveConf.ConfVars.DYNAMICPARTITIONING 
(hive.exec.dynamic.partition)` and `HiveConf.ConfVars.DYNAMICPARTITIONINGMODE 
(hive.exec.dynamic.partition.mode)` are two examples. As long as we generate 
the correct results and we can make sure the execution is robust, I think it is 
not necessarily to follow those confs.
    
    2. For `hive.exec.dynamic.partition.mode`, I think the purpose of it is to 
avoid having too many concurrent file writers in a task. Actually, even if 
`hive.exec.dynamic.partition.mode=strict`, we can still have many distinct 
values on those dynamic partitioning columns and thus, have too many file 
writers in a task. For those columnar file formats, like RCFile, ORC, and 
Parquet, every file writer internally maintain a memory buffer. Many file 
writers can significantly increase the memory footprint of a task and can 
introduce OOMs. Instead of relying on Hive's confs, it is better to provide a 
way to group data based on those dynamic partitioning columns. So, we will not 
have many concurrent file writers. Just two primitive ideas. We can shuffle the 
data before inserting. Or, we can do local grouping and write data in a 
group-by-group fashion. Anyway, I feel we may need to introduce changes to the 
planner. 
    
    3. The last comment is not quite related to this PR. I think it is better 
to have a general design on how table is partitioned and (hopefully,) Hive's 
directory layout in HDFS will be just a special case. I am not sure that 
creating a single file for every combination of values of partitioning columns 
is a good way. It introduces potential stability issues to the insert operation 
(too many file writers), and performance issues to both insert and table scan 
operations. With this approach, we can easily create a lots of small files in 
HDFS, which introduces memory pressure to the HDFS namenode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to