Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/1919#issuecomment-52509183
@baishuo Thank you for working on it.
I have three general comments.
1. Hive has a lots of confs that are used to influence how semantic
analyzer works. `HiveConf.ConfVars.DYNAMICPARTITIONING
(hive.exec.dynamic.partition)` and `HiveConf.ConfVars.DYNAMICPARTITIONINGMODE
(hive.exec.dynamic.partition.mode)` are two examples. As long as we generate
the correct results and we can make sure the execution is robust, I think it is
not necessarily to follow those confs.
2. For `hive.exec.dynamic.partition.mode`, I think the purpose of it is to
avoid having too many concurrent file writers in a task. Actually, even if
`hive.exec.dynamic.partition.mode=strict`, we can still have many distinct
values on those dynamic partitioning columns and thus, have too many file
writers in a task. For those columnar file formats, like RCFile, ORC, and
Parquet, every file writer internally maintain a memory buffer. Many file
writers can significantly increase the memory footprint of a task and can
introduce OOMs. Instead of relying on Hive's confs, it is better to provide a
way to group data based on those dynamic partitioning columns. So, we will not
have many concurrent file writers. Just two primitive ideas. We can shuffle the
data before inserting. Or, we can do local grouping and write data in a
group-by-group fashion. Anyway, I feel we may need to introduce changes to the
planner.
3. The last comment is not quite related to this PR. I think it is better
to have a general design on how table is partitioned and (hopefully,) Hive's
directory layout in HDFS will be just a special case. I am not sure that
creating a single file for every combination of values of partitioning columns
is a good way. It introduces potential stability issues to the insert operation
(too many file writers), and performance issues to both insert and table scan
operations. With this approach, we can easily create a lots of small files in
HDFS, which introduces memory pressure to the HDFS namenode.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]