[ 
https://issues.apache.org/jira/browse/HIVE-23597?focusedWorklogId=452175&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-452175
 ]

ASF GitHub Bot logged work on HIVE-23597:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Jun/20 05:43
            Start Date: 29/Jun/20 05:43
    Worklog Time Spent: 10m 
      Work Description: pvary commented on a change in pull request #1081:
URL: https://github.com/apache/hive/pull/1081#discussion_r446784234



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -1605,6 +1618,46 @@ public int compareTo(CompressedOwid other) {
         throw e; // rethrow the exception so that the caller can handle.
       }
     }
+
+    /**
+     * Create delete delta reader. Caching orc tail to avoid FS lookup/reads 
for repeated scans.
+     *
+     * @param deleteDeltaFile
+     * @param conf
+     * @param fs FileSystem
+     * @return delete file reader
+     * @throws IOException
+     */
+    private Reader getDeleteDeltaReader(Path deleteDeltaFile, JobConf conf, 
FileSystem fs) throws IOException {
+      OrcTail deleteDeltaTail = 
deleteDeltaOrcTailCache.getIfPresent(deleteDeltaFile);

Review comment:
       Is the OrcTail thread safe? If I understandcorrectly the tail will be 
read by multiple LLAP threads concurrenly




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 452175)
    Time Spent: 1h 10m  (was: 1h)

> VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete 
> delta directories multiple times
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23597
>                 URL: https://issues.apache.org/jira/browse/HIVE-23597
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1562]
> {code:java}
> try {
>         final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
>         if (deleteDeltaDirs.length > 0) {
>           int totalDeleteEventCount = 0;
>           for (Path deleteDeltaDir : deleteDeltaDirs) {
> {code}
>  
> Consider a directory layout like the following. This was created by having 
> simple set of "insert --> update --> select" queries.
>  
> {noformat}
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000001
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000002
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000003_0000003_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000004_0000004_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000005_0000005_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000006_0000006_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000007_0000007_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000008_0000008_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000009_0000009_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000010_0000010_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000011_0000011_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000012_0000012_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000013_0000013_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000003_0000003_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000004_0000004_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000005_0000005_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000006_0000006_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000007_0000007_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000008_0000008_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000009_0000009_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000010_0000010_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000011_0000011_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000012_0000012_0000
> /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000013_0000013_0000
>  {noformat}
>  
> Orcsplit contains all the delete delta folder information. For the directory 
> layout like this, it would create {{~12 splits}}. For every split, it 
> constructs "ColumnizedDeleteEventRegistry" in VRBAcidReader and ends up 
> reading all these delete delta folders multiple times.
>  In this case, it would read it approximately {{121 times!}}.
> This causes huge delay in running simple queries like "{{select * from 
> tab_x}}" in cloud storage. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to