[
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903708#action_12903708
]
Ning Zhang commented on HIVE-1602:
----------------------------------
I agree this will be a big change and we are tossing the ideas here. We don't
have a final plan yet.
HAR is one idea and definitely we should try it once HIVE-1467 is done. But as
you said it won't change the # of partitions. Check out some of our tables,
which has more than 240 partitions each day. With dynamic partition, it is very
easy to increase it even more.
Another idea Namit and I were talking about is to store the mapping from the
list of values {'s', 'm', 'l'} to the actual partition location and store this
mapping in the metastore. This essentially separates the logical concept of
partition from the physical storage location (HDFS directories). This could be
a big change and break some users' assumption who are relying on the reverse of
the mapping (figuring out partition from the HDFS directory).
If we decide to go this route, inserting is easy as we get the mapping from
metastore and decide which directory to write given an output row. Querying is
a little bit complicated as the partition prunning phase need to figure out
which physical directory a partition correspond to and get the partition column
value from the data file itself rather than from the directory name. The
overhead is of course the partition column value need extra storage in the data
file. But if we sort based on the partitioning column and with RCFile and
column level run-length compression (which we have already supported), the
storage overhead is very small.
> List Partitioning
> -----------------
>
> Key: HIVE-1602
> URL: https://issues.apache.org/jira/browse/HIVE-1602
> Project: Hadoop Hive
> Issue Type: New Feature
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition
> column values. Currently it creates one partition for each distinct DP column
> value. This could result in skews in the created dynamic partitions in that
> some partitions are large but there could be large number of small partitions
> as well. This results in burdens in HDFS as well as metastore. A list
> partitioning scheme that aggregate a number of small partitions into one big
> one is more preferable for skewed partitions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.