[jira] [Commented] (HIVE-13652) Import table change order of dynamic partitions

Sushanth Sowmyan (JIRA) Mon, 02 May 2016 11:59:40 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267266#comment-15267266
 ]


Sushanth Sowmyan commented on HIVE-13652:
-----------------------------------------

Adding some general background info for anyone who wishes to work on this:

(Note, this is not necessarily to do about Hive Export/Import, but about hive 
managed table partition creation in general, and the problem is that there 
isn't a "good" solution to this that won't bug someone the wrong way)

Given that the source can be any arbitrary table, even ones created by a user 
outside of hive, deciding what "order" to retain is tricky, or even difficult 
to know what "order" was used. This is so, since the source can have partition 
year=2012, hour=18, and yet have a directory that looks like any of the 
following:

{noformat}
/apps/hive/warehouse/weblogs/year=2012/hour=18
/apps/hive/warehouse/weblogs/2012/18
/apps/hive/warehouse/weblogs/201218/
/apps/hive/warehouse/weblogs/frank/
{noformat}

Thus, we do not store the correlation between partition key-values in source 
and destination, and the only thing we "know" is that a partition with a set of 
key-value-pairs is associated with some data that we read. Thus, in the 
destination, irrespective of what the source said about the dir name, we ignore 
it, and recreate a partition based only on key-value pair info, and let hive 
default loading mechanism pick the location for us.

==

The underlying problem here is this : currently, the list of key-values is 
stored as a HashMap which is not ordered, and thus, is not guaranteed to be 
identical across JDKs or OSes. This doesn't currently affect us, however, since 
it's only relevant at the time a partition is created, and as long as the 
metadata for the data is consistent to point to the correct location, hive 
doesn't care.

Since we don't force an order, that order is whatever native sorting order for 
that HashMap would be for those values, on that JDK version + OS. This means 
that as long as you don't change JDK version + OS + the keyvalues, it is 
repeatably consistent. Change even one of those, however, and you could easily 
wind up with this differing. This can even happen with Hive wherein we've done 
"ALTER TABLE ADD PARTITION" for a while on a cluster, upgrade a jdk, and then 
we do another "ALTER TABLE ADD PARTITION", and it picks dd/mm instead of mm/dd 
that it has been for a while. Or, if one machine was on ubuntu and the other on 
centos/etc.

Some possible solutions:
a) We can force order of key-values by order of key occurence in the metastore 
for all "new" partitions ever created in hive. The problem with this is that it 
might force additional metastore calls to determine this order(adding load).
b) We can force alphabetical order of key-values for all "new" partitions ever 
created in hive. The problem with this is that we now get into a notion of what 
is alphabetical order in what codepage (although that can still deterministic). 
It's also possible that going alphabetical will cause a pretty "dumb" ordering, 
where "dumb" in this case can mean  (i) non-intuitive : Say 
day=23/market_id=45/month=4/year=2016 , or (ii) bad in terms of skew, having a 
higher frequency partition separation be a parent of a lower freq one, 
resulting in a much larger number of dirs created.

Neither of these solve the original issue of export/import, because all we wind 
up doing here is forcing order going forward, and not making sure to "retain" 
whatever existed. Also, if a JDK/OS combination resulted in a different default 
for two different users for similar schema, then by "standardizing" it going 
forward, we break convention for one of them, either way.

Even in the cases where currently, export/import has been flipping a mm/dd/yyyy 
into a dd/mm/yyyy, for eg., if we standardize to fix it to retain original 
order, we make it weird for a bunch of users that have had a mm/dd/yyyy in 
place, and don't care about the order as long as it is consistent across the 
table(a goal I'd argue they shouldn't have/care about, but nevertheless one 
that might exist)

Other solutions that are possible:

a) Let a table specify that it cares about its default partition-naming-scheme 
: Similar to what  hcat.dynamic.partitioning.custom.pattern does for HCat . The 
problem with this is it can introduce complexity to a warehouse if people use 
this feature extensively - i.e. it does actually nothing for the data and perf 
in hive - it's simply for usability with external tools, and we run into a 
too-many-configs-why-was-this-feature-even-here scenario, but maybe we can 
ignore that.

b) Change export/import to honour existing order in the case of managed tables 
(but ignore order or customization for external tables, because we truly cannot 
determine what patterns might be used for external tables ) - this does not 
help existing export/import cases, and can decide on a different norm for a 
bunch of users, but does help a little going ahead.

Sorry for the longer than intended ramble, but this problem has been known 
about for a while and wasn't fixed because of these, and I wanted to provide 
context.

> Import table change order of dynamic partitions
> -----------------------------------------------
>
>                 Key: HIVE-13652
>                 URL: https://issues.apache.org/jira/browse/HIVE-13652
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.2.1
>            Reporter: Lukas Waldmann
>
> Table with multiple dynamic partitions like year,month, day exported using 
> "export table" command is imported (using "import table") such a way that 
> order of partitions is changed to day, month, year.
> Export DB:  Hive 0.14
> Import DB:  Hive 1.2.1000.2.4.0.0-169
> Tables created as:
> create table T1
> ( ... ) PARTITIONED BY (period_year string, period_month string, period_day 
> string) STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY");
> export command:
> export table t1 to 'path'
> import command:
> import table t1 from 'path'
> HDFS file structure on both original table location and export path keeps the 
> original partition order ../year/month/day
> HDFS file structure after import is .../day/month/year



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-13652) Import table change order of dynamic partitions

Reply via email to