[ 
https://issues.apache.org/jira/browse/HIVE-24021?focusedWorklogId=468660&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468660
 ]

ASF GitHub Bot logged work on HIVE-24021:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Aug/20 14:42
            Start Date: 10/Aug/20 14:42
    Worklog Time Spent: 10m 
      Work Description: klcopp commented on a change in pull request #1384:
URL: https://github.com/apache/hive/pull/1384#discussion_r467952862



##########
File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommandsForMmTable.java
##########
@@ -481,6 +482,61 @@ public void 
testOperationsOnCompletedTxnComponentsForMmTable() throws Exception
     verifyDirAndResult(0, true);
   }
 
+  /**
+   * Impala truncates insert-only tables by writing a base directory (like 
insert overwrite) containing an empty file
+   * named "_empty". Generally in Hive files beginning with an underscore are 
hidden, so here we make sure that Hive
+   * reads these bases correctly.
+   *
+   * @throws Exception
+   */
+  @Test
+  public void testImpalaTruncatedMmTable() throws Exception {
+    FileSystem fs = FileSystem.get(hiveConf);
+    FileStatus[] status;
+
+    Path tblLocation = new Path(TEST_WAREHOUSE_DIR + "/" +
+        (TableExtended.MMTBL).toString().toLowerCase());
+
+    // 1. Insert two rows to an MM table
+    runStatementOnDriver("insert into " + TableExtended.MMTBL + "(a,b) 
values(1,2)");
+    runStatementOnDriver("insert into " + TableExtended.MMTBL + "(a,b) 
values(3,4)");
+    status = fs.listStatus(tblLocation, FileUtils.STAGING_DIR_PATH_FILTER);
+    // There should be 2 delta dirs in the location
+    Assert.assertEquals(2, status.length);
+    for (int i = 0; i < status.length; i++) {
+      Assert.assertTrue(status[i].getPath().getName().matches("delta_.*"));
+    }
+
+    // 2. Simulate Impala truncating the table: write a base dir 
(base_0000003) containing an empty file.
+    // Hive will name the empty file "000000_0"
+    runStatementOnDriver("insert overwrite  table " + TableExtended.MMTBL + " 
select * from "
+        + TableExtended.MMTBL + " where 1=2");
+    status = fs.listStatus(tblLocation, FileUtils.STAGING_DIR_PATH_FILTER);
+    // There should be 2 delta dirs, plus 1 base dir in the location
+    Assert.assertEquals(3, status.length);
+    int baseCount = 0;
+    int deltaCount = 0;
+    for (int i = 0; i < status.length; i++) {
+      String dirName = status[i].getPath().getName();
+      if (dirName.matches("delta_.*")) {
+        deltaCount++;
+      } else {
+        baseCount++;
+      }
+    }
+    Assert.assertEquals(2, deltaCount);
+    Assert.assertEquals(1, baseCount);
+
+    // rename empty file to "_empty"
+    Path basePath = new Path(tblLocation, "base_0000003");
+    Assert.assertTrue("Rename failed",
+        fs.rename(new Path(basePath, "000000_0"), new Path(basePath, 
"_empty")));
+
+    // 3. Verify query result. Selecting from a truncated table should return 
nothing.
+    List<String> rs = runStatementOnDriver("select a,b from " + 
TableExtended.MMTBL + " order by a,b");
+    Assert.assertEquals(Collections.emptyList(), rs);
+  }

Review comment:
       Great idea!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 468660)
    Time Spent: 1h 10m  (was: 1h)

> Read insert-only tables truncated by Impala correctly
> -----------------------------------------------------
>
>                 Key: HIVE-24021
>                 URL: https://issues.apache.org/jira/browse/HIVE-24021
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Impala truncates insert-only tables by writing a base directory containing an 
> empty file named "_empty". (Like Hive should, see HIVE-20137) Generally in 
> Hive a file name beginning with an underscore connotes a temporary file that 
> isn't supposed to be read by operations that didn't create it.
>  Before HIVE-23495, getAcidState listed each directory in the table 
> (HdfsUtils#listLocatedStatus) – and filtered out directories with names 
> beginning with an underscore or period as they are presumably temporary. This 
> allowed files called "_empty" to be read, since hive checked the directory 
> name and not the file name.
>  After HIVE-23495, we recursively list each file in the table 
> (AcidUtils#getHdfsDirSnapshots) with a filter that doesn't accept files with 
> names beginning with an underscore or period as they are presumably 
> temporary. As a result Hive reads the table data as if the truncate operation 
> had not happened.
> Since performance in getAcidState is important, probably the best solution is 
> make an exception in the filter and accept files with the name "_empty".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to