marton-bod commented on a change in pull request #3748:
URL: https://github.com/apache/iceberg/pull/3748#discussion_r769577513



##########
File path: 
mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -777,6 +779,42 @@ public void testStatsPopulation() throws Exception {
     Assert.assertTrue(stats.startsWith("{\"BASIC_STATS\":\"true\"")); // it's 
followed by column stats in Hive3
   }
 
+  /**
+   * Tests that vectorized ORC reading code path correctly handles when the 
same ORC file is split into multiple parts.
+   * Although the split offsets and length will not always include the file 
tail that contains the metadata, the
+   * vectorized reader needs to make sure to handle the tail reading 
regardless of the offsets. If this is not done
+   * correctly, the last SELECT query will fail.
+   * @throws Exception - any test error
+   */
+  @Test
+  public void testVectorizedOrcMultipleSplits() throws Exception {
+    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
+
+    try {
+      // This data will be held by a ~870kB ORC file
+      List<Record> records = 
TestHelper.generateRandomRecords(HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+          20000, 0L);
+
+      // To support splitting the ORC file, we need to specify the stripe size 
to a small value. It looks like the min
+      // value is about 220kB, no smaller stripes are written by ORC. Anyway, 
this setting will produce 4 stripes.
+      shell.getHiveConf().set("orc.stripe.size", "200000");

Review comment:
       nit: would it work just as well with a slightly different value, like 
190k or 210k? It was hard for me to eyeball the difference between this 200k 
value and the 20k value above when generating the records. Made me think for a 
moment that the values are related somehow :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to