Suresh Subbiah created TRAFODION-2199:
-----------------------------------------
Summary: SELECT on Hive partitioned table causes timestamp
mismatch error
Key: TRAFODION-2199
URL: https://issues.apache.org/jira/browse/TRAFODION-2199
Project: Apache Trafodion
Issue Type: Bug
Components: sql-exe
Affects Versions: 2.1-incubating
Reporter: Suresh Subbiah
Assignee: Suresh Subbiah
Fix For: 2.1-incubating
During execute of a select query that includes Hive partitioned tables, this
error is occasinally seen.
SQL>execute s;
*** ERROR[8436] Mismatch detected between compiletime and runtime hive table
definitions. [2016-08-28 22:58:14]
*** WARNING[8597] Statement was automatically retried 1 time(s). Delay before
each retry was 0 seconds. See next entry for the error that caused this retry.
[2016-08-28 22:58:14]
*** WARNING[8436] Mismatch detected between compiletime and runtime hive table
definitions. [2016-08-28 22:58:14]
Workarounds :
cqd HIVE_DATA_MOD_CHECK 'OFF' ; will avoid the problem.
Alternatively the setting described here (found by Sandhya) will also avoid the
problem
https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/CDH-5-3-Hive-staging-directory-has-wrong-default-value/td-p/23585
http://stackoverflow.com/questions/35176190/hive-queries-generating-mismanaged-staging-directories
Hi Anoop, Sandhya
In this case both modTSforDir_ and the directory timestamp we get at runtime
are the same (1470914707).
This happened because during compile we do go through all files and get the max
timestamp of all files or directory as Anoop said. However this iteration
through all files is based on a list of hive_sd_structs. These sd_structs are
created by parsing a string we get from the java side. This string is similar
to DESCRIBE EXTENDED <table-name> output, with additional info for each
partition. Since this string does not contain information about non-partition
directories like hive-staging, they are excluded from the compile time
calculation. We need to do something similar at runtime. We cannot accept every
directory returned by hdfsListDirectory() for a partitioned table as valid.
As Eric and Sandhya said I will exclude directory names that include
“.hive-staging”. I could also insist that the directory name have a “=” sign in
it. This will be checked into Trafodion by tomorrow. I hope the cost of
checking so many names/timestamps is not prohibitive. In this case we have
about 2000 directory names to deal with. The directory names are rather long. I
suppose in the worst case this is about a million comparisons which could add
about a millisecond to execution time. Let me know if this is not a good idea.
Thanks
Suresh
At compile time, we get the max time of all files that are part of a table.
This timestamp is stored in modTSforDir_.
At runtime, we look at the directory timestamp. If this table got created/moved
into that directory and changed the timestamp, that will give a ts mismatch.
This mismatch will not be fixed even with a recompile as the recompile will
once again get the max of table files.
Can we tell if the runtime directory timestamp was set to indicate the max of
all files(hidden or actual) under that dir?
anoop
Looks like there are cases where this staging directory is left over after hive
uses it internally for it’s own processing. So as Suresh says, this is
something our code possibly needs to ignore . Seems a bit messy to ignore files
with this name but if we don’t then we could run into this issue…
Sandhya
It appears that hive itself creates and uses these staging directories and I
see some problems reported about this.
Need to look more. It’s not something we directly create ..
Sandhya
If this file has a higher timestamp, then should that have changed the timestamp
of the directory it is under? Or are there cases where it may not, and if there
are, then how do we get to that situation?
Is this file created during a hive insert or part of some hive query.
anoop
Specially given this is a hidden file, given it starts with .?
May be logic should just filter hidden files?
Eric
Hi,
Yes I agree with everyone that this is a bug. I debugged this and find that it
might be related to a problem Sandhya told me about a few days ago.
The directory for store_sales table has this timestamp
ptr_ = 0x7f3c578edc48
"hdfs://n01.trafodion.local:8020/user/hive/warehouse/tpcds.db/store_sales",
offset_ = 139897143745608}}
(gdb) p ((ComTdbOrcScan*)&tdb)->modTSforDir_
$12 = 1470914707
However this directory had 1820+ directories underneath it. One of them has a
timestamp that is slightly higher
Name = 0x74e2ce0
"hdfs://n01.trafodion.local:8020/user/hive/warehouse/tpcds.db/store_sales/.hive-staging_hive_2016-08-11_19-25-06_941_7211319284354636719-1",
mLastMod = 1470915220,
Typical directories under store_sales directory have this type of name, with a
literal for partition value. This staging directory seem to be something we
should ignore. I wonder if we should do that based on name or some other
criteria.
/user/hive/warehouse/tpcds.db/store_sales/ss_sold_date_sk=2450816
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)