[
https://issues.apache.org/jira/browse/HIVE-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267266#comment-15267266
]
Sushanth Sowmyan commented on HIVE-13652:
-----------------------------------------
Adding some general background info for anyone who wishes to work on this:
(Note, this is not necessarily to do about Hive Export/Import, but about hive
managed table partition creation in general, and the problem is that there
isn't a "good" solution to this that won't bug someone the wrong way)
Given that the source can be any arbitrary table, even ones created by a user
outside of hive, deciding what "order" to retain is tricky, or even difficult
to know what "order" was used. This is so, since the source can have partition
year=2012, hour=18, and yet have a directory that looks like any of the
following:
{noformat}
/apps/hive/warehouse/weblogs/year=2012/hour=18
/apps/hive/warehouse/weblogs/2012/18
/apps/hive/warehouse/weblogs/201218/
/apps/hive/warehouse/weblogs/frank/
{noformat}
Thus, we do not store the correlation between partition key-values in source
and destination, and the only thing we "know" is that a partition with a set of
key-value-pairs is associated with some data that we read. Thus, in the
destination, irrespective of what the source said about the dir name, we ignore
it, and recreate a partition based only on key-value pair info, and let hive
default loading mechanism pick the location for us.
==
The underlying problem here is this : currently, the list of key-values is
stored as a HashMap which is not ordered, and thus, is not guaranteed to be
identical across JDKs or OSes. This doesn't currently affect us, however, since
it's only relevant at the time a partition is created, and as long as the
metadata for the data is consistent to point to the correct location, hive
doesn't care.
Since we don't force an order, that order is whatever native sorting order for
that HashMap would be for those values, on that JDK version + OS. This means
that as long as you don't change JDK version + OS + the keyvalues, it is
repeatably consistent. Change even one of those, however, and you could easily
wind up with this differing. This can even happen with Hive wherein we've done
"ALTER TABLE ADD PARTITION" for a while on a cluster, upgrade a jdk, and then
we do another "ALTER TABLE ADD PARTITION", and it picks dd/mm instead of mm/dd
that it has been for a while. Or, if one machine was on ubuntu and the other on
centos/etc.
Some possible solutions:
a) We can force order of key-values by order of key occurence in the metastore
for all "new" partitions ever created in hive. The problem with this is that it
might force additional metastore calls to determine this order(adding load).
b) We can force alphabetical order of key-values for all "new" partitions ever
created in hive. The problem with this is that we now get into a notion of what
is alphabetical order in what codepage (although that can still deterministic).
It's also possible that going alphabetical will cause a pretty "dumb" ordering,
where "dumb" in this case can mean (i) non-intuitive : Say
day=23/market_id=45/month=4/year=2016 , or (ii) bad in terms of skew, having a
higher frequency partition separation be a parent of a lower freq one,
resulting in a much larger number of dirs created.
Neither of these solve the original issue of export/import, because all we wind
up doing here is forcing order going forward, and not making sure to "retain"
whatever existed. Also, if a JDK/OS combination resulted in a different default
for two different users for similar schema, then by "standardizing" it going
forward, we break convention for one of them, either way.
Even in the cases where currently, export/import has been flipping a mm/dd/yyyy
into a dd/mm/yyyy, for eg., if we standardize to fix it to retain original
order, we make it weird for a bunch of users that have had a mm/dd/yyyy in
place, and don't care about the order as long as it is consistent across the
table(a goal I'd argue they shouldn't have/care about, but nevertheless one
that might exist)
Other solutions that are possible:
a) Let a table specify that it cares about its default partition-naming-scheme
: Similar to what hcat.dynamic.partitioning.custom.pattern does for HCat . The
problem with this is it can introduce complexity to a warehouse if people use
this feature extensively - i.e. it does actually nothing for the data and perf
in hive - it's simply for usability with external tools, and we run into a
too-many-configs-why-was-this-feature-even-here scenario, but maybe we can
ignore that.
b) Change export/import to honour existing order in the case of managed tables
(but ignore order or customization for external tables, because we truly cannot
determine what patterns might be used for external tables ) - this does not
help existing export/import cases, and can decide on a different norm for a
bunch of users, but does help a little going ahead.
Sorry for the longer than intended ramble, but this problem has been known
about for a while and wasn't fixed because of these, and I wanted to provide
context.
> Import table change order of dynamic partitions
> -----------------------------------------------
>
> Key: HIVE-13652
> URL: https://issues.apache.org/jira/browse/HIVE-13652
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.2.0, 1.2.1
> Reporter: Lukas Waldmann
>
> Table with multiple dynamic partitions like year,month, day exported using
> "export table" command is imported (using "import table") such a way that
> order of partitions is changed to day, month, year.
> Export DB: Hive 0.14
> Import DB: Hive 1.2.1000.2.4.0.0-169
> Tables created as:
> create table T1
> ( ... ) PARTITIONED BY (period_year string, period_month string, period_day
> string) STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY");
> export command:
> export table t1 to 'path'
> import command:
> import table t1 from 'path'
> HDFS file structure on both original table location and export path keeps the
> original partition order ../year/month/day
> HDFS file structure after import is .../day/month/year
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)