Suresh Subbiah created TRAFODION-2199:
-----------------------------------------

             Summary: SELECT on Hive partitioned table causes timestamp 
mismatch error
                 Key: TRAFODION-2199
                 URL: https://issues.apache.org/jira/browse/TRAFODION-2199
             Project: Apache Trafodion
          Issue Type: Bug
          Components: sql-exe
    Affects Versions: 2.1-incubating
            Reporter: Suresh Subbiah
            Assignee: Suresh Subbiah
             Fix For: 2.1-incubating


During execute of a select query that includes Hive partitioned tables, this 
error is occasinally seen.

SQL>execute s;
 
*** ERROR[8436] Mismatch detected between compiletime and runtime hive table 
definitions. [2016-08-28 22:58:14]
*** WARNING[8597] Statement was automatically retried 1 time(s). Delay before 
each retry was 0 seconds. See next entry for the error that caused this retry.  
[2016-08-28 22:58:14]
*** WARNING[8436] Mismatch detected between compiletime and runtime hive table 
definitions. [2016-08-28 22:58:14]
 

Workarounds :
cqd HIVE_DATA_MOD_CHECK 'OFF' ; will avoid the problem.

Alternatively the setting described here (found by Sandhya) will also avoid the 
problem
https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/CDH-5-3-Hive-staging-directory-has-wrong-default-value/td-p/23585
http://stackoverflow.com/questions/35176190/hive-queries-generating-mismanaged-staging-directories

Hi Anoop, Sandhya
 
In this case both modTSforDir_ and the directory timestamp we get at runtime 
are the same (1470914707).
This happened because during compile we do go through all files and get the max 
timestamp of all files or directory as Anoop said. However this iteration 
through all files is based on a list of hive_sd_structs. These sd_structs are 
created by parsing a string we get from the java side. This string is similar 
to DESCRIBE EXTENDED <table-name> output, with additional info for each 
partition. Since this string does not contain information about non-partition 
directories like hive-staging, they are excluded from the compile time 
calculation. We need to do something similar at runtime. We cannot accept every 
directory returned by hdfsListDirectory() for a partitioned table as valid.
 
As Eric  and Sandhya said I will exclude directory names that include 
“.hive-staging”. I could also insist that the directory name have a “=” sign in 
it. This will be checked into Trafodion by tomorrow. I hope the cost of 
checking so many names/timestamps is not prohibitive. In this case we have 
about 2000 directory names to deal with. The directory names are rather long. I 
suppose in the worst case this is about a million comparisons which could add 
about a millisecond to execution time. Let me know if this is not a good idea.
 
Thanks
Suresh
 
 
 
 
At compile time, we get the max time of all files that are part of a table.
This timestamp is stored in modTSforDir_.
 
At runtime, we look at the directory timestamp. If this table got created/moved
into that directory and changed the timestamp, that will give a ts mismatch.
This mismatch will not be fixed even with a recompile as the recompile will
once again get the max of table files.
 
Can we tell if the runtime directory timestamp was set to indicate the max of
all files(hidden or actual) under that dir?
anoop
 

 
Looks like there are cases where this staging directory is left over after hive 
uses it internally for it’s own processing. So as Suresh says, this is 
something our code possibly needs to ignore . Seems a bit messy to ignore files 
with this name but if we don’t then we could run into this issue…
Sandhya
 
 
It appears that hive itself creates and uses these staging directories and I 
see some problems reported about this. 
Need to look more. It’s not something we directly create ..
 
Sandhya
 

If this file has a higher timestamp, then should that have changed the timestamp
of the directory it is under? Or are there cases where it may not, and if there
are, then how do we get to that situation?
 
Is this file created during a hive insert or part of some hive query.
 
anoop
 
 
Specially given this is a hidden file, given it starts with .?
May be logic should just filter hidden files?
Eric
 

Hi,
 
Yes I agree with everyone that this is a bug. I debugged this and find that it 
might be related to a problem Sandhya told me about a few days ago.
 
The directory for store_sales table has this timestamp
ptr_ = 0x7f3c578edc48 
"hdfs://n01.trafodion.local:8020/user/hive/warehouse/tpcds.db/store_sales", 
offset_ = 139897143745608}}
(gdb) p ((ComTdbOrcScan*)&tdb)->modTSforDir_
$12 = 1470914707
 
However this directory had 1820+ directories underneath it. One of them has a 
timestamp that is slightly higher
Name = 0x74e2ce0 
"hdfs://n01.trafodion.local:8020/user/hive/warehouse/tpcds.db/store_sales/.hive-staging_hive_2016-08-11_19-25-06_941_7211319284354636719-1",
 
mLastMod = 1470915220,
 

Typical directories under store_sales directory have this type of name, with a 
literal for partition value. This staging directory seem to be something we 
should ignore. I wonder if we should do that based on name or some other 
criteria.
/user/hive/warehouse/tpcds.db/store_sales/ss_sold_date_sk=2450816




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to