attilapiros commented on pull request #2971:
URL: https://github.com/apache/hadoop/pull/2971#issuecomment-1066131080
I had an idea for testing this committer (and some of the others too).
Actually I started to put it together but now I am a bit uncertain how
valuable this will be so I decided to share it with you in the current state
before I waste too much time.
The basic idea is simple: as everything is based on the output directory
content we can check the unified diff after each of the operations. One
possibility is to represent the whole directory content as string (here the
`before` and `after` just arbitrary labels).
My asserts become:
```java
final Path textOutputPath = writeTextOutput(tContext);
dirState = assertDirStateDiff(dirState, Arrays.asList(
"--- before",
"+++ after",
"@@ -1,0 +1,9 @@",
"+file:///_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0/part-m-00000",
"+~~~",
"+key1\tval1",
"+val1",
"+val2",
"+key2",
"+key1",
"+key2\tval2",
"+~~~"));
```
As you see I replaced the changing IDs with an _ID_ literal and from the
absolute paths I have removed the output directory part.
And here comes what I don't like (the huge json files):
```java
commitTask(committer, tContext);
dirState = assertDirStateDiff(dirState, Arrays.asList(
"--- before",
"+++ after",
"@@ -9,1 +9,24 @@",
" ~~~",
"+file:///_temporary/job_ID_0001/01/manifests/task_ID_0001_m_000000-manifest.json",
"+~~~",
"+{",
"+ \"extraData\" : { },",
"+ \"type\" :
\"org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.TaskManifest/1\",",
"+ \"version\" : 1,",
"+ \"jobId\" : \"job_ID_0001\",",
"+ \"jobAttemptNumber\" : 1,",
"+ \"taskID\" : \"task_ID_0001_m_000000\",",
"+ \"taskAttemptID\" : \"attempt_ID_0001_m_000000_0\",",
"+ \"taskAttemptDir\" :
\"file:/_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0\",",
"+ \"files\" : [ {",
"+ \"s\" :
\"file:///_temporary/job_ID_0001/01/tasks/attempt_ID_0001_m_000000_0/part-m-00000\",",
"+ \"d\" : \"file:/part-m-00000\",",
"+ \"z\" : 40",
"+ } ],",
"+ \"directories\" : [ ],",
"+ \"iostatistics\" : {",
"+ \"counters\" : {",
"+ \"committer_commit_job\" : 0,",
"+ \"op_msync\" : 0,",
"+ \"op_msync.failures\" : 0,",
"+~~~"));
```
Here `assertDirStateDiff` always cut the files longer then 20 lines. But we
could use a an external resource too for the longer files.
The code is not that complex:
```java
private static List<String> unifiedDiff(List<String> before, List<String>
after) {
List<String> res =
UnifiedDiffUtils.generateUnifiedDiff("before", "after", before,
DiffUtils.diff(before, after), 1);
// LOG.info("{}", "\"" + String.join("\", \"", res) + "\"");
return res;
}
private static String replaceChangingParts(String replace, String input) {
return input.replaceFirst("job_[0-9]*_", "job_ID_")
.replaceFirst("attempt_[0-9]*_", "attempt_ID_")
.replaceFirst("task_[0-9]*_", "task_ID_")
.replace(replace, "");
}
private void getOutputDirContent(List<String> lines, Path path) throws
IOException {
FileStatus[] statuses = getFileSystem().listStatus(path);
for (FileStatus status : statuses) {
if (status.isDirectory()) {
getOutputDirContent(lines, status.getPath());
} else {
lines.add(replaceChangingParts(getOutputDir().toUri().getPath().toString(),
status.getPath().toUri().toString()));
lines.add("~~~");
try(FSDataInputStream in = getFileSystem().open(status.getPath());
InputStreamReader is = new InputStreamReader(in);
BufferedReader br = new BufferedReader(is)) {
br.lines().limit(20).forEach((l) ->
lines.add(replaceChangingParts(getOutputDir().toUri().getPath().toString(),
l)));
}
lines.add("~~~");
}
}
}
private List<String> assertDirStateDiff(List<String> previousDirState,
List<String> diffList) throws IOException {
List<String> currentDirState = getOutputDirContent();
Assertions.assertThat(unifiedDiff(previousDirState, currentDirState))
.isEqualTo(diffList);
return currentDirState;
}
private List<String> getOutputDirContent() throws IOException {
List<String> lines = new ArrayList<>();
getOutputDirContent(lines, getOutputDir());
return lines;
}
```
For the unified diff I am using a new dependency (it has Apache-2.0 License):
```
<dependency>
<groupId>io.github.java-diff-utils</groupId>
<artifactId>java-diff-utils</artifactId>
<version>4.9</version>
</dependency>
```
I think one of the advantage is `getOutputDirContent` is always a full scan
and if something need to be removed it will be in contained in the unified diff
and never be left out.
Do you see any value in this idea? Is not this too much restrictions on this
code?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]