pvary commented on a change in pull request #3748:
URL: https://github.com/apache/iceberg/pull/3748#discussion_r769569090



##########
File path: 
mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -777,6 +779,42 @@ public void testStatsPopulation() throws Exception {
     Assert.assertTrue(stats.startsWith("{\"BASIC_STATS\":\"true\"")); // it's 
followed by column stats in Hive3
   }
 
+  /**
+   * Tests that vectorized ORC reading code path correctly handles when the 
same ORC file is split into multiple parts.
+   * Although the split offsets and length will not always include the file 
tail that contains the metadata, the
+   * vectorized reader needs to make sure to handle the tail reading 
regardless of the offsets. If this is not done
+   * correctly, the last SELECT query will fail.
+   * @throws Exception - any test error
+   */
+  @Test
+  public void testVectorizedOrcMultipleSplits() throws Exception {
+    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
+
+    try {
+      // This data will be held by a ~870kB ORC file
+      List<Record> records = 
TestHelper.generateRandomRecords(HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+          20000, 0L);
+
+      // To support splitting the ORC file, we need to specify the stripe size 
to a small value. It looks like the min
+      // value is about 220kB, no smaller stripes are written by ORC. Anyway, 
this setting will produce 4 stripes.
+      shell.getHiveConf().set("orc.stripe.size", "200000");
+
+      testTables.createTable(shell, "targettab", 
HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+          fileFormat, records);
+
+      // Will request 4 splits, separated on the exact stripe boundaries 
within the ORC file.
+      // (Would request 5 if ORC split generation wouldn't be split (aka 
stripe) offset aware).
+      shell.getHiveConf().set(InputFormatConfig.SPLIT_SIZE, "200000");

Review comment:
       Maybe use shell.setHiveSessionValue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to