[jira] Reopened: (HIVE-1001) CombinedHiveInputFormat should parse the inputpath correctly

Dave Lerman (JIRA) Mon, 28 Dec 2009 12:17:54 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dave Lerman reopened HIVE-1001:
-------------------------------


This patch appears to affect other functionality.  Without the patch, "insert 
overwrite directory 'out' select ... from ..." yields 1 MR job, but with the 
patch, it yields 2 MR jobs.  At first glance, it looks like the second job in 
the insert overwrite plan is a conditional operation where it either does an 
HDFS move to move the data from the temp output directory to its final 
destination, or runs a MR job to copy the data (maybe if it needs to copy 
across clusters?).  Before the patch, it just does the HDFS copy, but after it, 
it does the MR copy.  Maybe the paths its comparing to determine which task to 
run are getting screwed up by the patch?

Steps to reproduce:

Using Cloudera's Hadoop 0.20.1+152 (since the CombineFile functionality doesn't 
work in 0.20.1 without the extra patches):

* Create a data file twolines.dat containing:

key1^Avalue1
key2^Avalue2

* Create a table with two partitions each containing that data:

CREATE TABLE fourlinestest(KEY STRING, VALUE STRING) PARTITIONED BY (part int) 
STORED AS TEXTFILE;
load data local inpath 'twolines.dat' into table fourlinestest partition 
(part=1);
load data local inpath 'twolines.dat' into table fourlinestest partition 
(part=2);

* Using Hive r888452 (before this patch was applied):

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
insert overwrite directory 'out' select key from fourlinestest;

--> The log shows that that the pools are getting created with corrupt paths 
(which was the bug that was fixed), but the job runs successfully with 1 MR job:

Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_200912240057_7597, Tracking URL = 
http://REMOVED:50030/jobdetails.jsp?jobid=job_200912240057_7597
Kill Command = REMOVED/bin/hadoop job  -Dmapred.job.tracker=REMOVED:8021 -kill 
job_200912240057_7597
2009-12-28 14:47:02,350 Stage-1 map = 0%,  reduce = 0%
2009-12-28 14:47:14,189 Stage-1 map = 100%,  reduce = 0%
2009-12-28 14:47:18,404 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_200912240057_7597
Launching Job 2 out of 2
Moving data to: hdfs://REMOVED/561636114/10000
Moving data to: out
4 Rows loaded to out
OK

* Apply the patch, rebuild, and rerun

patch -p0 < hive.1001.1.patch
ant package
bin/hive
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
insert overwrite directory 'out' select key from fourlinestest;

--> This time the second job actually runs a MR job (which then fails because 
of HIVE-1006).

Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_200912240057_7616, Tracking URL = 
http://REMOVED:50030/jobdetails.jsp?jobid=job_200912240057_7616
Kill Command = REMOVED/bin/hadoop job  -Dmapred.job.tracker=REMOVED:8021 -kill 
job_200912240057_7616
2009-12-28 14:54:39,414 Stage-1 map = 0%,  reduce = 0%
2009-12-28 14:54:51,224 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_200912240057_7616
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
java.io.IOException: cannot find dir = 
hdfs://RMOVED/1656362434/10001/attempt_200912240057_7616_m_000000_0 in 
partToPartitionInfo!


> CombinedHiveInputFormat should parse the inputpath correctly
> ------------------------------------------------------------
>
>                 Key: HIVE-1001
>                 URL: https://issues.apache.org/jira/browse/HIVE-1001
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Zheng Shao
>            Assignee: Namit Jain
>             Fix For: 0.5.0
>
>         Attachments: hive.1001.1.patch
>
>
> From David Lerman:
> "
> I'm running into errors where CombinedHiveInputFormat is combining data from
> two different tables which is causing problems because the tables have
> different input formats.
> It looks like the problem is in
> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
> CombineFileInputFormat.getInputPaths which returns the list of input paths
> and then chops off the first 5 characters to remove file: from the
> beginning, but the return value I'm getting from getInputPaths is actually
> hdfs://domain/path.  So then when it creates the pools using these paths,
> none of the input paths match the pools (since they're just the file path
> which protocol or domain).
> "
> We should use Path.getPath() to get the path part of an URI instead of just 
> chopping off 5 chars.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HIVE-1001) CombinedHiveInputFormat should parse the inputpath correctly

Reply via email to