[GitHub] [hadoop] attilapiros commented on pull request #2971: MAPREDUCE-7341. Intermediate Manifest Committer

GitBox Sun, 13 Mar 2022 09:01:19 -0700


attilapiros commented on pull request #2971:
URL: https://github.com/apache/hadoop/pull/2971#issuecomment-1066131080



   I had an idea for testing this committer (and some of the others too). 
   Actually I started to put it together but now I am a bit uncertain how 
valuable this will be so I decided to share it with you in the current state 
before I waste too much time.
   
   The basic idea is simple: as everything is based on the output directory 
content we can check the unified diff after each of the operations. One 
possibility is to represent the whole directory content as string (here the 
`before` and `after` just arbitrary labels).
   
   My asserts become:
   ```java
       final Path textOutputPath = writeTextOutput(tContext);
       dirState = assertDirStateDiff(dirState, Arrays.asList(
           "--- before",
           "+++ after",
           "@@ -1,0 +1,9 @@",
           
"+file:///_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0/part-m-00000",
           "+~~~",
           "+key1\tval1",
           "+val1",
           "+val2",
           "+key2",
           "+key1",
           "+key2\tval2",
           "+~~~"));
   ```
   
   As you see I replaced the changing IDs with an _ID_  literal and from the 
absolute paths I have removed the output directory part.
   
   And here comes what I don't like (the huge json files):
   ```java
     commitTask(committer, tContext);
     dirState = assertDirStateDiff(dirState, Arrays.asList(
       "--- before",
       "+++ after",
       "@@ -9,1 +9,24 @@",
       " ~~~",
       
"+file:///_temporary/job_ID_0001/01/manifests/task_ID_0001_m_000000-manifest.json",
       "+~~~",
       "+{",
       "+  \"extraData\" : { },",
       "+  \"type\" : 
\"org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.TaskManifest/1\",",
       "+  \"version\" : 1,",
       "+  \"jobId\" : \"job_ID_0001\",",
       "+  \"jobAttemptNumber\" : 1,",
       "+  \"taskID\" : \"task_ID_0001_m_000000\",",
       "+  \"taskAttemptID\" : \"attempt_ID_0001_m_000000_0\",",
       "+  \"taskAttemptDir\" : 
\"file:/_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0\",",
       "+  \"files\" : [ {",
       "+    \"s\" : 
\"file:///_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0/part-m-00000\",",
       "+    \"d\" : \"file:/part-m-00000\",",
       "+    \"z\" : 40",
       "+  } ],",
       "+  \"directories\" : [ ],",
       "+  \"iostatistics\" : {",
       "+    \"counters\" : {",
       "+      \"committer_commit_job\" : 0,",
       "+      \"op_msync\" : 0,",
       "+      \"op_msync.failures\" : 0,",
       "+~~~"));
   ``` 
   Here `assertDirStateDiff` always cut the files longer then 20 lines. But we 
could use a an external resource too for the longer files.
   
   The code is not that complex:
   
   ```java
     private static List<String> unifiedDiff(List<String> before, List<String> 
after) {
       List<String> res =
         UnifiedDiffUtils.generateUnifiedDiff("before", "after", before, 
DiffUtils.diff(before, after), 1);
       // LOG.info("{}", "\"" + String.join("\", \"", res) + "\"");
       return res;
     }
   
     private static String replaceChangingParts(String replace, String input) {
       return input.replaceFirst("job_[0-9]*_", "job_ID_")
         .replaceFirst("attempt_[0-9]*_", "attempt_ID_")
         .replaceFirst("task_[0-9]*_", "task_ID_")
         .replace(replace, "");
     }
   
     private void getOutputDirContent(List<String> lines, Path path) throws 
IOException {
       FileStatus[] statuses = getFileSystem().listStatus(path);
       for (FileStatus status : statuses) {
         if (status.isDirectory()) {
           getOutputDirContent(lines, status.getPath());
         } else {
           
lines.add(replaceChangingParts(getOutputDir().toUri().getPath().toString(), 
status.getPath().toUri().toString()));
           lines.add("~~~");
           try(FSDataInputStream in = getFileSystem().open(status.getPath());
               InputStreamReader is = new InputStreamReader(in);
               BufferedReader br = new BufferedReader(is)) {
             br.lines().limit(20).forEach((l) -> 
lines.add(replaceChangingParts(getOutputDir().toUri().getPath().toString(), 
l)));
           }
           lines.add("~~~");
         }
       }
     }
   
     private List<String> assertDirStateDiff(List<String> previousDirState, 
List<String> diffList) throws IOException {
       List<String> currentDirState = getOutputDirContent();
       Assertions.assertThat(unifiedDiff(previousDirState, currentDirState))
         .isEqualTo(diffList);
       return currentDirState;
     }
   
     private List<String> getOutputDirContent() throws IOException {
       List<String> lines = new ArrayList<>();
       getOutputDirContent(lines, getOutputDir());
       return lines;
     }
     ```
   
   For the unified diff I am using a new dependency (it has Apache-2.0 License):
   ```
       <dependency>
         <groupId>io.github.java-diff-utils</groupId>
         <artifactId>java-diff-utils</artifactId>
         <version>4.9</version>
       </dependency>
   ```
   
   I think one of the advantage is `getOutputDirContent` is always a full scan 
and if something need to be removed it will be in contained in the unified diff 
and never be left out.
   Do you see any value in this idea? Is not this too much restrictions on this 
code? 
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] attilapiros commented on pull request #2971: MAPREDUCE-7341. Intermediate Manifest Committer

Reply via email to