[GitHub] [iceberg] qphien commented on a change in pull request #2052: Hive: Fix join issues when CBO is enabled

GitBox Sat, 09 Jan 2021 20:15:18 -0800


qphien commented on a change in pull request #2052:
URL: https://github.com/apache/iceberg/pull/2052#discussion_r554041892




##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -66,9 +67,7 @@ public void initialize(@Nullable Configuration configuration, 
Properties serDePr
     assertNotVectorizedTez(configuration);
 
     Schema tableSchema;
-    if (configuration.get(InputFormatConfig.TABLE_SCHEMA) != null) {
-      tableSchema = 
SchemaParser.fromJson(configuration.get(InputFormatConfig.TABLE_SCHEMA));
-    } else if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {
+    if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {

Review comment:
       As show below:
   * Mapper handles input split belongs to one table: 
   
https://github.com/apache/hive/blob/113f6af7528f016bf821f7a746bad496cc93f834/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L406
   * Function **copyTableJobPropertiesToConf** copies 'iceberg.mr.table.schema' 
property to jobConf:
   
https://github.com/apache/hive/blob/113f6af7528f016bf821f7a746bad496cc93f834/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2427-L2443
    
   if we join two tables, only one iceberg schema exists in jobConf which leads 
wrong inspector:
   * empty inspector(non-overlap in two table selected columns, e.g. 
   `SELECT o.order_id, o.customer_id, o.total, p.name FROM default.orders o 
JOIN default.products p ON o.product_id = p.id ORDER BY o.order_id`
   selected columns in table default.orders: [order_id, total, customer_id, 
product_id]
   selected columns in table default.products: [id, name])
   * incomplete inspector(overlap in two table selected columns, e.g.
   `SELECT c.first_name, o.order_id FROM default.orders o JOIN 
default.customers c ON o.customer_id = c.customer_id ORDER BY o.order_id DESC`
   selected columns in table default.customers: [first_name, customer_id]
   selected columns in table default.orders: [order_id, customer_id]
   
   

##########
File path: 
mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -42,24 +42,41 @@
 import org.junit.runner.RunWith;
 import org.junit.runners.Parameterized;
 
+import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.apache.iceberg.types.Types.NestedField.required;
 import static org.junit.runners.Parameterized.Parameter;
 import static org.junit.runners.Parameterized.Parameters;
 
 @RunWith(Parameterized.class)
 public class TestHiveIcebergStorageHandlerWithEngine {
 
-  private static final String[] EXECUTION_ENGINES = new String[] {"tez", "mr"};
+  private static final String[] EXECUTION_ENGINES = new String[]{"tez", "mr"};

Review comment:
       Sorry, it's my fault, my IDEA automatically reformat code. I will pay 
more attention next time.

##########
File path: 
mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -42,24 +42,41 @@
 import org.junit.runner.RunWith;
 import org.junit.runners.Parameterized;
 
+import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.apache.iceberg.types.Types.NestedField.required;
 import static org.junit.runners.Parameterized.Parameter;
 import static org.junit.runners.Parameterized.Parameters;
 
 @RunWith(Parameterized.class)
 public class TestHiveIcebergStorageHandlerWithEngine {
 
-  private static final String[] EXECUTION_ENGINES = new String[] {"tez", "mr"};
+  private static final String[] EXECUTION_ENGINES = new String[]{"tez", "mr"};
+
+  private static final String[] CBO_ENABLES = new String[]{"true", "false"};

Review comment:
       Okay, i will move CBO_ENABLES list to `parameters` method

##########
File path: 
mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -142,29 +164,61 @@ public void testScanTable() throws IOException {
 
     // Adding the ORDER BY clause will cause Hive to spawn a local MR job this 
time.
     List<Object[]> descRows =
-        shell.executeStatement("SELECT first_name, customer_id FROM 
default.customers ORDER BY customer_id DESC");
+            shell.executeStatement("SELECT first_name, customer_id FROM 
default.customers ORDER BY customer_id DESC");
 
     Assert.assertEquals(3, descRows.size());
-    Assert.assertArrayEquals(new Object[] {"Trudy", 2L}, descRows.get(0));
-    Assert.assertArrayEquals(new Object[] {"Bob", 1L}, descRows.get(1));
-    Assert.assertArrayEquals(new Object[] {"Alice", 0L}, descRows.get(2));
+    Assert.assertArrayEquals(new Object[]{"Trudy", 2L}, descRows.get(0));
+    Assert.assertArrayEquals(new Object[]{"Bob", 1L}, descRows.get(1));
+    Assert.assertArrayEquals(new Object[]{"Alice", 0L}, descRows.get(2));
+  }
+
+  @Test
+  public void testSelectedColumnsNoOverlapJoin() throws IOException {
+    testTables.createTable(shell, "products", PRODUCT_SCHEMA, fileFormat, 
PRODUCT_RECORDS);
+    testTables.createTable(shell, "orders", ORDER_SCHEMA, fileFormat, 
ORDER_RECORDS);
+
+    List<Object[]> rows = shell.executeStatement(
+            "SELECT o.order_id, o.customer_id, o.total, p.name " +
+                    "FROM default.orders o JOIN default.products p ON 
o.product_id = p.id ORDER BY o.order_id"
+    );
+
+    Assert.assertEquals(3, rows.size());
+    Assert.assertArrayEquals(new Object[]{100L, 0L, 11.11d, "skirt"}, 
rows.get(0));
+    Assert.assertArrayEquals(new Object[]{101L, 0L, 22.22d, "tee"}, 
rows.get(1));
+    Assert.assertArrayEquals(new Object[]{102L, 1L, 33.33d, "watch"}, 
rows.get(2));
   }
 
   @Test
-  public void testJoinTables() throws IOException {
+  public void testSelectedColumnsOverlapJoin() throws IOException {
     testTables.createTable(shell, "customers", 
HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA, fileFormat,
-        HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS);
+            HiveIcebergStorageHandlerTestUtils.CUSTOMER_RECORDS);
+    testTables.createTable(shell, "orders", ORDER_SCHEMA, fileFormat, 
ORDER_RECORDS);
+
+    List<Object[]> rows = shell.executeStatement(
+            "SELECT c.first_name, o.order_id " +
+                    "FROM default.orders o JOIN default.customers c ON 
o.customer_id = c.customer_id " +
+                    "ORDER BY o.order_id DESC"
+    );
+
+    Assert.assertEquals(3, rows.size());
+    Assert.assertArrayEquals(new Object[]{"Bob", 102L}, rows.get(0));
+    Assert.assertArrayEquals(new Object[]{"Alice", 101L}, rows.get(1));
+    Assert.assertArrayEquals(new Object[]{"Alice", 100L}, rows.get(2));
+  }
+
+  @Test
+  public void testSelfJoin() throws IOException {
     testTables.createTable(shell, "orders", ORDER_SCHEMA, fileFormat, 
ORDER_RECORDS);
 
     List<Object[]> rows = shell.executeStatement(
-            "SELECT c.customer_id, c.first_name, o.order_id, o.total " +
-                    "FROM default.customers c JOIN default.orders o ON 
c.customer_id = o.customer_id " +
-                    "ORDER BY c.customer_id, o.order_id"
+            "SELECT o1.order_id, o1.customer_id, o1.total " +
+                    "FROM default.orders o1 JOIN default.orders o2 ON 
o1.order_id = o2.order_id ORDER BY o1.order_id"

Review comment:
       As show in 
https://github.com/apache/hive/blob/113f6af7528f016bf821f7a746bad496cc93f834/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L992-L998,
 
    self join can lead duplicated columns in jobConf 
`hive.io.file.readcolumn.names`. 
   In this case, it is necessary to test whether self join can work correctly.

##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -66,9 +67,7 @@ public void initialize(@Nullable Configuration configuration, 
Properties serDePr
     assertNotVectorizedTez(configuration);
 
     Schema tableSchema;
-    if (configuration.get(InputFormatConfig.TABLE_SCHEMA) != null) {
-      tableSchema = 
SchemaParser.fromJson(configuration.get(InputFormatConfig.TABLE_SCHEMA));
-    } else if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {
+    if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {

Review comment:
       Hi @marton-bod , is there any thread work on the proposal @pvary 
mentioned above? I can not find any related issue on github. 
   Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] qphien commented on a change in pull request #2052: Hive: Fix join issues when CBO is enabled

Reply via email to