Re: [PR] ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files. [orc]

via GitHub Wed, 15 Apr 2026 06:49:34 -0700


Copilot commented on code in PR #2601:
URL: https://github.com/apache/orc/pull/2601#discussion_r3086947980



##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -100,11 +132,77 @@ public static void main(Configuration conf, String[] 
args) throws Exception {
     }
   }
 
+  /**
+   * Multi-output behavior when --maxSize is set.
+   * Input files are grouped by cumulative raw file size; each group is merged 
into
+   * a separate part file (part-00000.orc, part-00001.orc, ...) under 
outputDir.
+   * A single file whose size already exceeds maxSizeBytes is placed in its 
own part.
+   */
+  private static void mergeIntoMultipleFiles(Configuration conf,
+                                              OrcFile.WriterOptions 
writerOptions,
+                                              List<LocatedFileStatus> 
inputStatuses,
+                                              List<Path> inputFiles,
+                                              Path outputDir,
+                                              long maxSizeBytes) throws 
Exception {
+    FileSystem outFs = outputDir.getFileSystem(conf);
+    outFs.mkdirs(outputDir);
+
+    // Group input files into batches where each batch's total size <= 
maxSizeBytes.
+    List<List<Path>> batches = new ArrayList<>();
+    List<Path> currentBatch = new ArrayList<>();
+    long currentBatchSize = 0;
+
+    for (LocatedFileStatus status : inputStatuses) {
+      long fileSize = status.getLen();
+      if (!currentBatch.isEmpty() && currentBatchSize + fileSize > 
maxSizeBytes) {
+        batches.add(currentBatch);
+        currentBatch = new ArrayList<>();
+        currentBatchSize = 0;
+      }
+      currentBatch.add(status.getPath());
+      currentBatchSize += fileSize;
+    }
+    if (!currentBatch.isEmpty()) {
+      batches.add(currentBatch);
+    }
+
+    int totalMerged = 0;
+    List<Path> allUnmerged = new ArrayList<>();
+
+    for (int i = 0; i < batches.size(); i++) {
+      List<Path> batch = batches.get(i);
+      Path partOutput = new Path(outputDir, String.format(PART_FILE_FORMAT, 
i));
+      List<Path> merged = OrcFile.mergeFiles(partOutput, 
OrcFile.writerOptions(conf), batch);

Review Comment:
   `mergeIntoMultipleFiles` takes `writerOptions` but ignores it and instead 
creates new `OrcFile.writerOptions(conf)` for each batch. This is confusing and 
can also drop caller-specified options (eg overwrite/key provider) if they are 
added later. Consider either removing the unused parameter, or (preferably) 
cloning the passed `writerOptions` for each batch (WriterOptions supports 
`clone()`) and passing the clone into `OrcFile.mergeFiles`.
   ```suggestion
         OrcFile.WriterOptions batchWriterOptions = writerOptions.clone();
         List<Path> merged = OrcFile.mergeFiles(partOutput, batchWriterOptions, 
batch);
   ```



##########
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java:
##########
@@ -107,4 +108,82 @@ public void testMerge() throws Exception {
       assertEquals(10000 + 20000, reader.getNumberOfRows());
     }
   }
+
+  /**
+   * Verifies that --maxSize splits input files into multiple part files under 
the output
+   * directory. Three source files are created; a tight size threshold forces 
them to be
+   * written into at least two part files.
+   */
+  @Test
+  public void testMergeWithMaxSize() throws Exception {
+    TypeDescription schema = 
TypeDescription.fromString("struct<x:int,y:string>");
+
+    // Create 3 source ORC files with different row counts.
+    String[] sourceNames = {
+        workDir + File.separator + "ms-1.orc",
+        workDir + File.separator + "ms-2.orc",
+        workDir + File.separator + "ms-3.orc"
+    };
+    int[] rowCounts = {5000, 5000, 5000};
+    for (int f = 0; f < sourceNames.length; f++) {
+      Writer writer = OrcFile.createWriter(new Path(sourceNames[f]),
+          OrcFile.writerOptions(conf).setSchema(schema));
+      VectorizedRowBatch batch = schema.createRowBatch();
+      LongColumnVector x = (LongColumnVector) batch.cols[0];
+      BytesColumnVector y = (BytesColumnVector) batch.cols[1];
+      for (int r = 0; r < rowCounts[f]; ++r) {
+        int row = batch.size++;
+        x.vector[row] = r;
+        byte[] buffer = ("val-" + r).getBytes();
+        y.setRef(row, buffer, 0, buffer.length);
+        if (batch.size == batch.getMaxSize()) {
+          writer.addRowBatch(batch);
+          batch.reset();
+        }
+      }
+      if (batch.size != 0) {
+        writer.addRowBatch(batch);
+      }
+      writer.close();
+    }
+
+    // Measure the size of the first source file to compute a threshold that 
forces a split.
+    long singleFileSize = fs.getFileStatus(new Path(sourceNames[0])).getLen();
+    // Threshold: slightly larger than one file so at most one file fits per 
part.
+    long maxSize = singleFileSize + 1;
+
+    Path outputDir = new Path(workDir + File.separator + "merge-multi-out");
+    fs.delete(outputDir, true);
+
+    PrintStream origOut = System.out;
+    ByteArrayOutputStream myOut = new ByteArrayOutputStream();
+    System.setOut(new PrintStream(myOut, false, StandardCharsets.UTF_8));
+    MergeFiles.main(conf, new String[]{workDir.toString(),
+        "--output", outputDir.toString(),
+        "--maxSize", String.valueOf(maxSize)});
+    System.out.flush();
+    System.setOut(origOut);

Review Comment:
   `System.setOut(...)` is not restored in a `finally` block. If 
`MergeFiles.main` throws, stdout will remain redirected and may break 
subsequent tests. Wrap the stdout redirection/restoration in `try/finally` (or 
use a helper/extension) so it is always restored.
   ```suggestion
       try {
         MergeFiles.main(conf, new String[]{workDir.toString(),
             "--output", outputDir.toString(),
             "--maxSize", String.valueOf(maxSize)});
         System.out.flush();
       } finally {
         System.setOut(origOut);
       }
   ```



##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -100,11 +132,77 @@ public static void main(Configuration conf, String[] 
args) throws Exception {
     }
   }
 
+  /**
+   * Multi-output behavior when --maxSize is set.
+   * Input files are grouped by cumulative raw file size; each group is merged 
into
+   * a separate part file (part-00000.orc, part-00001.orc, ...) under 
outputDir.
+   * A single file whose size already exceeds maxSizeBytes is placed in its 
own part.
+   */
+  private static void mergeIntoMultipleFiles(Configuration conf,
+                                              OrcFile.WriterOptions 
writerOptions,
+                                              List<LocatedFileStatus> 
inputStatuses,
+                                              List<Path> inputFiles,
+                                              Path outputDir,
+                                              long maxSizeBytes) throws 
Exception {
+    FileSystem outFs = outputDir.getFileSystem(conf);
+    outFs.mkdirs(outputDir);

Review Comment:
   In multi-file mode the code calls `outFs.mkdirs(outputDir)` without 
validating what already exists at `outputDir`. If `--output` points to an 
existing file (or an existing non-empty directory), later writes may fail in 
confusing ways or partially succeed. Consider checking for 
`outFs.exists(outputDir)` / `outFs.isFile(outputDir)` and either failing fast 
with a clear message or requiring an empty directory.
   ```suggestion
       if (outFs.exists(outputDir)) {
         if (!outFs.getFileStatus(outputDir).isDirectory()) {
           throw new IllegalArgumentException(
               "Output path already exists and is not a directory: " + 
outputDir);
         }
         if (outFs.listStatus(outputDir).length > 0) {
           throw new IllegalArgumentException(
               "Output directory must be empty for multi-file merge: " + 
outputDir);
         }
       } else if (!outFs.mkdirs(outputDir)) {
         throw new IllegalStateException("Failed to create output directory: " 
+ outputDir);
       }
   ```



##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -56,27 +60,55 @@ public static void main(Configuration conf, String[] args) 
throws Exception {
     }
     boolean ignoreExtension = cli.hasOption("ignoreExtension");
 
-    List<Path> inputFiles = new ArrayList<>();
-    OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(conf);
+    long maxSizeBytes = 0;
+    if (cli.hasOption("maxSize")) {
+      maxSizeBytes = Long.parseLong(cli.getOptionValue("maxSize"));
+      if (maxSizeBytes <= 0) {
+        System.err.println("--maxSize must be a positive number of bytes.");
+        System.exit(1);
+      }
+    }

Review Comment:
   `Long.parseLong(cli.getOptionValue("maxSize"))` will throw 
`NumberFormatException` for non-numeric inputs and currently results in an 
uncaught exception/stack trace. Consider catching the parse error, printing a 
clear message (and help text if appropriate), and exiting with a non-zero 
status.



##########
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java:
##########
@@ -107,4 +108,82 @@ public void testMerge() throws Exception {
       assertEquals(10000 + 20000, reader.getNumberOfRows());
     }
   }
+
+  /**
+   * Verifies that --maxSize splits input files into multiple part files under 
the output
+   * directory. Three source files are created; a tight size threshold forces 
them to be
+   * written into at least two part files.
+   */
+  @Test
+  public void testMergeWithMaxSize() throws Exception {
+    TypeDescription schema = 
TypeDescription.fromString("struct<x:int,y:string>");
+
+    // Create 3 source ORC files with different row counts.
+    String[] sourceNames = {
+        workDir + File.separator + "ms-1.orc",
+        workDir + File.separator + "ms-2.orc",
+        workDir + File.separator + "ms-3.orc"
+    };
+    int[] rowCounts = {5000, 5000, 5000};
+    for (int f = 0; f < sourceNames.length; f++) {

Review Comment:
   The comment says the 3 source ORC files have "different row counts", but 
`rowCounts` is `{5000, 5000, 5000}`. Either update the comment or vary the row 
counts so the test matches its description.



##########
java/tools/src/java/org/apache/orc/tools/MergeFiles.java:
##########
@@ -113,6 +211,15 @@ private static Options createOptions() {
         .desc("Ignore ORC file extension")
         .build());
 
+    result.addOption(Option.builder("m")
+        .longOpt("maxSize")
+        .desc("Maximum size in bytes for each output ORC file. When set, 
--output is treated as "
+            + "an output directory and merged files are written as 
part-00000.orc, "
+            + "part-00001.orc, etc. Files are grouped at file boundaries so an 
individual "
+            + "file larger than this threshold will still be placed in its own 
part.")
+        .hasArg()
+        .build());

Review Comment:
   `--maxSize` is enforced using the cumulative *input* file sizes 
(`LocatedFileStatus#getLen()`), not the actual size of the generated output ORC 
part files. The option description currently reads as if it caps the output 
file size; consider rewording to avoid implying a strict output-size limit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files. [orc]

Reply via email to