Mithun Radhakrishnan created HIVE-15575:
-------------------------------------------

             Summary: ALTER TABLE CONCATENATE and hive.merge.tezfiles seems 
busted for UNION ALL output
                 Key: HIVE-15575
                 URL: https://issues.apache.org/jira/browse/HIVE-15575
             Project: Hive
          Issue Type: Bug
            Reporter: Mithun Radhakrishnan
            Priority: Critical


Hive {{UNION ALL}} produces data in sub-directories under the table/partition 
directories. E.g.

{noformat}
hive (mythdb_hadooppf_17544)> create table source ( foo string, bar string, goo 
string ) stored as textfile;
OK
Time taken: 0.322 seconds
hive (mythdb_hadooppf_17544)> create table results_partitioned( foo string, bar 
string, goo string ) partitioned by ( dt string ) stored as orcfile;
OK
Time taken: 0.322 seconds
hive (mythdb_hadooppf_17544)> set hive.merge.tezfiles=false; insert overwrite 
table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from 
source UNION ALL select 'go', 'far', 'moo', '1' from source;
...
Loading data to table mythdb_hadooppf_17544.results_partitioned partition 
(dt=null)
         Time taken for load dynamic partitions : 311
        Loading partition {dt=1}
         Time taken for adding to write entity : 3
OK
Time taken: 27.659 seconds
hive (mythdb_hadooppf_17544)> dfs -ls -R 
/tmp/mythdb_hadooppf_17544/results_partitioned;
drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1
-rwxrwxrwt   3 dfsload hdfs        349 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1/000000_0
drwxrwxrwt   - dfsload hdfs          0 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2
-rwxrwxrwt   3 dfsload hdfs        368 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2/000000_0
{noformat}

These results can only be read if {{mapred.input.dir.recursive=true}}, as 
{{TezCompiler::init()}} seems to do. But the Hadoop default for this is 
{{false}}. This leads to the following errors:
1. Running {{CONCATENATE}} on the partition on the partition causes data-loss.
{noformat}
hive --database mythdb_hadooppf_17544 -e " set mapred.input.dir.recursive; 
alter table results_partitioned partition ( dt='1' ) concatenate ; set 
mapred.input.dir.recursive; "
...
OK
Time taken: 2.151 seconds
mapred.input.dir.recursive=false


Status: Running (Executing on YARN cluster with App id 
application_1481756273279_5088754)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
File Merge         SUCCEEDED      0          0        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [>>--------------------------] 0%    ELAPSED TIME: 0.35 s
--------------------------------------------------------------------------------
Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=1)
Moved: 
'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1'
 to trash at: 
hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
Moved: 
'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2'
 to trash at: 
hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
OK
Time taken: 25.873 seconds

$ hdfs dfs -count -h /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
           1            0                  0 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
{noformat}

2. hive.merge.tezfiles is busted, because the merge-task attempts to merge 
files across {{results_partitioned/dt=1/1}} and {{results_partitioned/dt=1/2}}:
{noformat}
$ hive --database mythdb_hadooppf_17544 -e " set hive.merge.tezfiles=true; 
insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 
'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source; "
...
Query ID = dfsload_20170110233558_51289333-d9da-4851-8671-bfe653d26e45
Total jobs = 3
Launching Job 1 out of 3


Status: Running (Executing on YARN cluster with App id 
application_1481756273279_5089989)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 3 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 13.07 s
--------------------------------------------------------------------------------
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3


Status: Running (Executing on YARN cluster with App id 
application_1481756273279_5089989)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
File Merge           RUNNING      1          0        1        0       2       0
--------------------------------------------------------------------------------
VERTICES: 00/01  [>>--------------------------] 0%    ELAPSED TIME: 3.06 s
--------------------------------------------------------------------------------
...
{noformat}

The {{File Merge}} fails with the following:

{noformat}
TaskAttempt 3 failed, info=[Error: Failure while running 
task:java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple 
partitions for one merge mapper: 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
 NOT EQUAL TO 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
        at 
org.apache.hadoop.hive.ql.exec.tez.MergeFileTezProcessor.run(MergeFileTezProcessor.java:42)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:192)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:184)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:184)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:180)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Multiple 
partitions for one merge mapper: 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
 NOT EQUAL TO 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
        at 
org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:217)
        at 
org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.run(MergeFileRecordProcessor.java:151)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
        ... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Multiple partitions for one merge mapper: 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
 NOT EQUAL TO 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
        at 
org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:159)
        at 
org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.process(OrcFileMergeOperator.java:62)
        at 
org.apache.hadoop.hive.ql.exec.tez.MergeFileRecordProcessor.processRow(MergeFileRecordProcessor.java:208)
        ... 16 more
Caused by: java.io.IOException: Multiple partitions for one merge mapper: 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/2
 NOT EQUAL TO 
hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/.hive-staging_hive_2017-01-10_23-35-58_881_4062579557908207136-1/-ext-10002/dt=1/1
        at 
org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.checkPartitionsMatch(AbstractFileMergeOperator.java:174)
        at 
org.apache.hadoop.hive.ql.exec.AbstractFileMergeOperator.fixTmpPath(AbstractFileMergeOperator.java:191)
        at 
org.apache.hadoop.hive.ql.exec.OrcFileMergeOperator.processKeyValuePairs(OrcFileMergeOperator.java:86)
        ... 18 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
killedTasks:0, Vertex vertex_1481756273279_5089989_2_00 [File Merge] 
killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to 
VERTEX_FAILURE. failedVertices:1 killedVertices:0
{noformat}

3. Data produced with Hive {{UNION ALL}} will not be readable by Pig/HCatalog, 
without {{mapred.input.dir.recursive}}.

Setting {{mapred.input.dir.recursive=true}} in {{hive-site.xml}} should resolve 
the first and third problem. But is this the recommendation? This is intrusive, 
and doesn't solve #2. The Pig {{UNION}} doesn't work this way, as per my 
limited understanding.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to