Copilot commented on code in PR #2601:
URL: https://github.com/apache/orc/pull/2601#discussion_r3086947980
##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -100,11 +132,77 @@ public static void main(Configuration conf, String[]
args) throws Exception {
}
}
+ /**
+ * Multi-output behavior when --maxSize is set.
+ * Input files are grouped by cumulative raw file size; each group is merged
into
+ * a separate part file (part-00000.orc, part-00001.orc, ...) under
outputDir.
+ * A single file whose size already exceeds maxSizeBytes is placed in its
own part.
+ */
+ private static void mergeIntoMultipleFiles(Configuration conf,
+ OrcFile.WriterOptions
writerOptions,
+ List<LocatedFileStatus>
inputStatuses,
+ List<Path> inputFiles,
+ Path outputDir,
+ long maxSizeBytes) throws
Exception {
+ FileSystem outFs = outputDir.getFileSystem(conf);
+ outFs.mkdirs(outputDir);
+
+ // Group input files into batches where each batch's total size <=
maxSizeBytes.
+ List<List<Path>> batches = new ArrayList<>();
+ List<Path> currentBatch = new ArrayList<>();
+ long currentBatchSize = 0;
+
+ for (LocatedFileStatus status : inputStatuses) {
+ long fileSize = status.getLen();
+ if (!currentBatch.isEmpty() && currentBatchSize + fileSize >
maxSizeBytes) {
+ batches.add(currentBatch);
+ currentBatch = new ArrayList<>();
+ currentBatchSize = 0;
+ }
+ currentBatch.add(status.getPath());
+ currentBatchSize += fileSize;
+ }
+ if (!currentBatch.isEmpty()) {
+ batches.add(currentBatch);
+ }
+
+ int totalMerged = 0;
+ List<Path> allUnmerged = new ArrayList<>();
+
+ for (int i = 0; i < batches.size(); i++) {
+ List<Path> batch = batches.get(i);
+ Path partOutput = new Path(outputDir, String.format(PART_FILE_FORMAT,
i));
+ List<Path> merged = OrcFile.mergeFiles(partOutput,
OrcFile.writerOptions(conf), batch);
Review Comment:
`mergeIntoMultipleFiles` takes `writerOptions` but ignores it and instead
creates new `OrcFile.writerOptions(conf)` for each batch. This is confusing and
can also drop caller-specified options (eg overwrite/key provider) if they are
added later. Consider either removing the unused parameter, or (preferably)
cloning the passed `writerOptions` for each batch (WriterOptions supports
`clone()`) and passing the clone into `OrcFile.mergeFiles`.
```suggestion
OrcFile.WriterOptions batchWriterOptions = writerOptions.clone();
List<Path> merged = OrcFile.mergeFiles(partOutput, batchWriterOptions,
batch);
```
##########
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java:
##########
@@ -107,4 +108,82 @@ public void testMerge() throws Exception {
assertEquals(10000 + 20000, reader.getNumberOfRows());
}
}
+
+ /**
+ * Verifies that --maxSize splits input files into multiple part files under
the output
+ * directory. Three source files are created; a tight size threshold forces
them to be
+ * written into at least two part files.
+ */
+ @Test
+ public void testMergeWithMaxSize() throws Exception {
+ TypeDescription schema =
TypeDescription.fromString("struct<x:int,y:string>");
+
+ // Create 3 source ORC files with different row counts.
+ String[] sourceNames = {
+ workDir + File.separator + "ms-1.orc",
+ workDir + File.separator + "ms-2.orc",
+ workDir + File.separator + "ms-3.orc"
+ };
+ int[] rowCounts = {5000, 5000, 5000};
+ for (int f = 0; f < sourceNames.length; f++) {
+ Writer writer = OrcFile.createWriter(new Path(sourceNames[f]),
+ OrcFile.writerOptions(conf).setSchema(schema));
+ VectorizedRowBatch batch = schema.createRowBatch();
+ LongColumnVector x = (LongColumnVector) batch.cols[0];
+ BytesColumnVector y = (BytesColumnVector) batch.cols[1];
+ for (int r = 0; r < rowCounts[f]; ++r) {
+ int row = batch.size++;
+ x.vector[row] = r;
+ byte[] buffer = ("val-" + r).getBytes();
+ y.setRef(row, buffer, 0, buffer.length);
+ if (batch.size == batch.getMaxSize()) {
+ writer.addRowBatch(batch);
+ batch.reset();
+ }
+ }
+ if (batch.size != 0) {
+ writer.addRowBatch(batch);
+ }
+ writer.close();
+ }
+
+ // Measure the size of the first source file to compute a threshold that
forces a split.
+ long singleFileSize = fs.getFileStatus(new Path(sourceNames[0])).getLen();
+ // Threshold: slightly larger than one file so at most one file fits per
part.
+ long maxSize = singleFileSize + 1;
+
+ Path outputDir = new Path(workDir + File.separator + "merge-multi-out");
+ fs.delete(outputDir, true);
+
+ PrintStream origOut = System.out;
+ ByteArrayOutputStream myOut = new ByteArrayOutputStream();
+ System.setOut(new PrintStream(myOut, false, StandardCharsets.UTF_8));
+ MergeFiles.main(conf, new String[]{workDir.toString(),
+ "--output", outputDir.toString(),
+ "--maxSize", String.valueOf(maxSize)});
+ System.out.flush();
+ System.setOut(origOut);
Review Comment:
`System.setOut(...)` is not restored in a `finally` block. If
`MergeFiles.main` throws, stdout will remain redirected and may break
subsequent tests. Wrap the stdout redirection/restoration in `try/finally` (or
use a helper/extension) so it is always restored.
```suggestion
try {
MergeFiles.main(conf, new String[]{workDir.toString(),
"--output", outputDir.toString(),
"--maxSize", String.valueOf(maxSize)});
System.out.flush();
} finally {
System.setOut(origOut);
}
```
##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -100,11 +132,77 @@ public static void main(Configuration conf, String[]
args) throws Exception {
}
}
+ /**
+ * Multi-output behavior when --maxSize is set.
+ * Input files are grouped by cumulative raw file size; each group is merged
into
+ * a separate part file (part-00000.orc, part-00001.orc, ...) under
outputDir.
+ * A single file whose size already exceeds maxSizeBytes is placed in its
own part.
+ */
+ private static void mergeIntoMultipleFiles(Configuration conf,
+ OrcFile.WriterOptions
writerOptions,
+ List<LocatedFileStatus>
inputStatuses,
+ List<Path> inputFiles,
+ Path outputDir,
+ long maxSizeBytes) throws
Exception {
+ FileSystem outFs = outputDir.getFileSystem(conf);
+ outFs.mkdirs(outputDir);
Review Comment:
In multi-file mode the code calls `outFs.mkdirs(outputDir)` without
validating what already exists at `outputDir`. If `--output` points to an
existing file (or an existing non-empty directory), later writes may fail in
confusing ways or partially succeed. Consider checking for
`outFs.exists(outputDir)` / `outFs.isFile(outputDir)` and either failing fast
with a clear message or requiring an empty directory.
```suggestion
if (outFs.exists(outputDir)) {
if (!outFs.getFileStatus(outputDir).isDirectory()) {
throw new IllegalArgumentException(
"Output path already exists and is not a directory: " +
outputDir);
}
if (outFs.listStatus(outputDir).length > 0) {
throw new IllegalArgumentException(
"Output directory must be empty for multi-file merge: " +
outputDir);
}
} else if (!outFs.mkdirs(outputDir)) {
throw new IllegalStateException("Failed to create output directory: "
+ outputDir);
}
```
##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -56,27 +60,55 @@ public static void main(Configuration conf, String[] args)
throws Exception {
}
boolean ignoreExtension = cli.hasOption("ignoreExtension");
- List<Path> inputFiles = new ArrayList<>();
- OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(conf);
+ long maxSizeBytes = 0;
+ if (cli.hasOption("maxSize")) {
+ maxSizeBytes = Long.parseLong(cli.getOptionValue("maxSize"));
+ if (maxSizeBytes <= 0) {
+ System.err.println("--maxSize must be a positive number of bytes.");
+ System.exit(1);
+ }
+ }
Review Comment:
`Long.parseLong(cli.getOptionValue("maxSize"))` will throw
`NumberFormatException` for non-numeric inputs and currently results in an
uncaught exception/stack trace. Consider catching the parse error, printing a
clear message (and help text if appropriate), and exiting with a non-zero
status.
##########
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java:
##########
@@ -107,4 +108,82 @@ public void testMerge() throws Exception {
assertEquals(10000 + 20000, reader.getNumberOfRows());
}
}
+
+ /**
+ * Verifies that --maxSize splits input files into multiple part files under
the output
+ * directory. Three source files are created; a tight size threshold forces
them to be
+ * written into at least two part files.
+ */
+ @Test
+ public void testMergeWithMaxSize() throws Exception {
+ TypeDescription schema =
TypeDescription.fromString("struct<x:int,y:string>");
+
+ // Create 3 source ORC files with different row counts.
+ String[] sourceNames = {
+ workDir + File.separator + "ms-1.orc",
+ workDir + File.separator + "ms-2.orc",
+ workDir + File.separator + "ms-3.orc"
+ };
+ int[] rowCounts = {5000, 5000, 5000};
+ for (int f = 0; f < sourceNames.length; f++) {
Review Comment:
The comment says the 3 source ORC files have "different row counts", but
`rowCounts` is `{5000, 5000, 5000}`. Either update the comment or vary the row
counts so the test matches its description.
##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -113,6 +211,15 @@ private static Options createOptions() {
.desc("Ignore ORC file extension")
.build());
+ result.addOption(Option.builder("m")
+ .longOpt("maxSize")
+ .desc("Maximum size in bytes for each output ORC file. When set,
--output is treated as "
+ + "an output directory and merged files are written as
part-00000.orc, "
+ + "part-00001.orc, etc. Files are grouped at file boundaries so an
individual "
+ + "file larger than this threshold will still be placed in its own
part.")
+ .hasArg()
+ .build());
Review Comment:
`--maxSize` is enforced using the cumulative *input* file sizes
(`LocatedFileStatus#getLen()`), not the actual size of the generated output ORC
part files. The option description currently reads as if it caps the output
file size; consider rewording to avoid implying a strict output-size limit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]