[ https://issues.apache.org/jira/browse/HCATALOG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541482#comment-13541482 ]
Daniel Dai commented on HCATALOG-580: ------------------------------------- HCATALOG-580-1.patch fix Pig_Check_7, but not fix Pig_Checkin_1. HCATALOG-580-2.patch fix both Pig_Checkin_1 & Pig_Checkin_7(and all tests pass). Both HCATALOG-580 & HCATALOG-584 fix all tests. HCATALOG-580 fix the tests by fixing the logic 538 introduce, HCATALOG-584 fix the tests by solve the cause of the failures. We need to commit both, 580 for fix, 584 for bullet proof. Here is more details about issue 538 introduce and how 580, 584 fixing the issue: 1. 538 optimize nn usage by moving partition directory instead of leaf file 2. 538 find the partition directory by assuming the first child of that directory is a file, which is wrong (can be _temporary, _logs) 3. 538 move the partition directory out, assuming it is two level deep (_TEMP/partition), which is wrong for _temporary, _logs, which raise exception for _temporary, _logs for directory not exist 4. 584 solve the issue by create the directory before move partition out (fs.rename), 580 solve the issue by fixing 538 logic 5. 584 make sure rename succeed, but lose some optimization 538 introduce (the fold containing _temporary will not be treated as partition folder, and will not move the partition as a whole) > Optimizations in HCAT-538 break e2e tests > ----------------------------------------- > > Key: HCATALOG-580 > URL: https://issues.apache.org/jira/browse/HCATALOG-580 > Project: HCatalog > Issue Type: Bug > Affects Versions: 0.5 > Environment: RH 5.8 (on AWS) > Hadoop 1.1.2.17 (build) > HCat 0.5 (build) > Reporter: Sushanth Sowmyan > Assignee: Daniel Dai > Priority: Blocker > Fix For: 0.5 > > Attachments: HCATALOG-580-1.patch, HCATALOG-580-2.patch > > > The optimizations brought in by HCATALOG-538 break dynamic partitioning in > the e2e tests. The issue is that the assumption that if the first child in a > directory structure is a directory, the rest are directories, and if the > first child is a file, then the rest are files is an incorrect one. > (Admittedly, one part of that, that of assuming that if the first child is a > file, the assumption that it is a leaf directory is not necessarily a bad one > in premise, although still incorrect) > The issue with this is that underlying FileOutputCommitter and OutputFormat > behaviour would affect whether or not you get files or directories, or > whether there would be any _temporary directories still left behind, for eg. > In the case I tested, the issue is that there is a _temporary directory in a > "leaf" directory, followed by part files. The optimization sees the > _temporary directory, finds nothing inside it, so doesn't mkdir any parent, > then decides that the rest are directories, then moves to the part file, and > tries to rename it directly without mkdir-ing its parent directory. > The e2e test conf in question is Pig_Checkin_7 > {code} > { > 'num' => 7 > ,'hcat_prep'=>q\drop table if exists > pig_checkin_7; > create table pig_checkin_7 (name string, age int) partitioned by (ds string) > STORED AS TEXTFILE;\ > ,'pig' => q\a = load 'studentparttab30k' > using org.apache.hcatalog.pig.HCatLoader(); > b = foreach a generate name, age, ds; > store b into 'pig_checkin_7' using org.apache.hcatalog.pig.HCatStorer();\, > ,'result_table' => 'pig_checkin_7', > ,'sql' => "select name, age, ds from > studentparttab30k;", > ,'floatpostprocess' => 1 > ,'delimiter' => ' ' > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira