[ https://issues.apache.org/jira/browse/HIVE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dave Lerman reopened HIVE-1001: ------------------------------- This patch appears to affect other functionality. Without the patch, "insert overwrite directory 'out' select ... from ..." yields 1 MR job, but with the patch, it yields 2 MR jobs. At first glance, it looks like the second job in the insert overwrite plan is a conditional operation where it either does an HDFS move to move the data from the temp output directory to its final destination, or runs a MR job to copy the data (maybe if it needs to copy across clusters?). Before the patch, it just does the HDFS copy, but after it, it does the MR copy. Maybe the paths its comparing to determine which task to run are getting screwed up by the patch? Steps to reproduce: Using Cloudera's Hadoop 0.20.1+152 (since the CombineFile functionality doesn't work in 0.20.1 without the extra patches): * Create a data file twolines.dat containing: key1^Avalue1 key2^Avalue2 * Create a table with two partitions each containing that data: CREATE TABLE fourlinestest(KEY STRING, VALUE STRING) PARTITIONED BY (part int) STORED AS TEXTFILE; load data local inpath 'twolines.dat' into table fourlinestest partition (part=1); load data local inpath 'twolines.dat' into table fourlinestest partition (part=2); * Using Hive r888452 (before this patch was applied): set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; insert overwrite directory 'out' select key from fourlinestest; --> The log shows that that the pools are getting created with corrupt paths (which was the bug that was fixed), but the job runs successfully with 1 MR job: Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200912240057_7597, Tracking URL = http://REMOVED:50030/jobdetails.jsp?jobid=job_200912240057_7597 Kill Command = REMOVED/bin/hadoop job -Dmapred.job.tracker=REMOVED:8021 -kill job_200912240057_7597 2009-12-28 14:47:02,350 Stage-1 map = 0%, reduce = 0% 2009-12-28 14:47:14,189 Stage-1 map = 100%, reduce = 0% 2009-12-28 14:47:18,404 Stage-1 map = 100%, reduce = 100% Ended Job = job_200912240057_7597 Launching Job 2 out of 2 Moving data to: hdfs://REMOVED/561636114/10000 Moving data to: out 4 Rows loaded to out OK * Apply the patch, rebuild, and rerun patch -p0 < hive.1001.1.patch ant package bin/hive set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; insert overwrite directory 'out' select key from fourlinestest; --> This time the second job actually runs a MR job (which then fails because of HIVE-1006). Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_200912240057_7616, Tracking URL = http://REMOVED:50030/jobdetails.jsp?jobid=job_200912240057_7616 Kill Command = REMOVED/bin/hadoop job -Dmapred.job.tracker=REMOVED:8021 -kill job_200912240057_7616 2009-12-28 14:54:39,414 Stage-1 map = 0%, reduce = 0% 2009-12-28 14:54:51,224 Stage-1 map = 100%, reduce = 0% Ended Job = job_200912240057_7616 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> java.io.IOException: cannot find dir = hdfs://RMOVED/1656362434/10001/attempt_200912240057_7616_m_000000_0 in partToPartitionInfo! > CombinedHiveInputFormat should parse the inputpath correctly > ------------------------------------------------------------ > > Key: HIVE-1001 > URL: https://issues.apache.org/jira/browse/HIVE-1001 > Project: Hadoop Hive > Issue Type: Bug > Affects Versions: 0.5.0 > Reporter: Zheng Shao > Assignee: Namit Jain > Fix For: 0.5.0 > > Attachments: hive.1001.1.patch > > > From David Lerman: > " > I'm running into errors where CombinedHiveInputFormat is combining data from > two different tables which is causing problems because the tables have > different input formats. > It looks like the problem is in > org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim. It calls > CombineFileInputFormat.getInputPaths which returns the list of input paths > and then chops off the first 5 characters to remove file: from the > beginning, but the return value I'm getting from getInputPaths is actually > hdfs://domain/path. So then when it creates the pools using these paths, > none of the input paths match the pools (since they're just the file path > which protocol or domain). > " > We should use Path.getPath() to get the path part of an URI instead of just > chopping off 5 chars. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.