Re: [PR] ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files. [orc]

via GitHub Tue, 21 Apr 2026 06:01:33 -0700


cxzl25 commented on code in PR #2601:
URL: https://github.com/apache/orc/pull/2601#discussion_r3117582102



##########
java/tools/src/test/org/apache/orc/tools/TestMergeFiles.java:
##########
@@ -107,4 +108,85 @@ public void testMerge() throws Exception {
       assertEquals(10000 + 20000, reader.getNumberOfRows());
     }
   }
+
+  /**
+   * Verifies that --maxSize splits input files into multiple part files under 
the output
+   * directory. Three source files are created; a tight size threshold forces 
them to be
+   * written into at least two part files.
+   */
+  @Test
+  public void testMergeWithMaxSize() throws Exception {
+    TypeDescription schema = 
TypeDescription.fromString("struct<x:int,y:string>");
+
+    // Create 3 source ORC files.
+    String[] sourceNames = {
+        workDir + File.separator + "ms-1.orc",
+        workDir + File.separator + "ms-2.orc",
+        workDir + File.separator + "ms-3.orc"
+    };
+    int[] rowCounts = {5000, 5000, 5000};
+    for (int f = 0; f < sourceNames.length; f++) {
+      Writer writer = OrcFile.createWriter(new Path(sourceNames[f]),
+          OrcFile.writerOptions(conf).setSchema(schema));
+      VectorizedRowBatch batch = schema.createRowBatch();
+      LongColumnVector x = (LongColumnVector) batch.cols[0];
+      BytesColumnVector y = (BytesColumnVector) batch.cols[1];
+      for (int r = 0; r < rowCounts[f]; ++r) {
+        int row = batch.size++;
+        x.vector[row] = r;
+        byte[] buffer = ("val-" + r).getBytes();
+        y.setRef(row, buffer, 0, buffer.length);
+        if (batch.size == batch.getMaxSize()) {
+          writer.addRowBatch(batch);
+          batch.reset();
+        }
+      }
+      if (batch.size != 0) {
+        writer.addRowBatch(batch);
+      }
+      writer.close();
+    }
+
+    // Measure the size of the first source file to compute a threshold that 
forces a split.
+    long singleFileSize = fs.getFileStatus(new Path(sourceNames[0])).getLen();
+    // Threshold: slightly larger than one file so at most one file fits per 
part.
+    long maxSize = singleFileSize + 1;

Review Comment:
   Is it possible to group by the first two file sizes + 1 to test if the merge 
really works？
   
   sourceNames[0] len + sourceNames[1] len +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] ORC-2149: Supports merging multiple ORC files with the same schema into multiple ORC files. [orc]

Reply via email to