[ https://issues.apache.org/jira/browse/HIVE-23597?focusedWorklogId=442855&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-442855 ]
ASF GitHub Bot logged work on HIVE-23597: ----------------------------------------- Author: ASF GitHub Bot Created on: 09/Jun/20 16:05 Start Date: 09/Jun/20 16:05 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #1081: URL: https://github.com/apache/hive/pull/1081#discussion_r437191512 ########## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ########## @@ -1561,24 +1572,22 @@ public int compareTo(CompressedOwid other) { try { final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit); if (deleteDeltaDirs.length > 0) { + FileSystem fs = orcSplit.getPath().getFileSystem(conf); + AcidOutputFormat.Options orcSplitMinMaxWriteIds = + AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), conf); int totalDeleteEventCount = 0; for (Path deleteDeltaDir : deleteDeltaDirs) { - FileSystem fs = deleteDeltaDir.getFileSystem(conf); + if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, deleteDeltaDir)) { + continue; + } Path[] deleteDeltaFiles = OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket, new OrcRawRecordMerger.Options().isCompacting(false), null); for (Path deleteDeltaFile : deleteDeltaFiles) { - // NOTE: Calling last flush length below is more for future-proofing when we have - // streaming deletes. But currently we don't support streaming deletes, and this can - // be removed if this becomes a performance issue. - long length = OrcAcidUtils.getLastFlushLength(fs, deleteDeltaFile); + // NOTE: When streaming deletes are supported, consider using OrcAcidUtils.getLastFlushLength(fs, deleteDeltaFile) // NOTE: A check for existence of deleteDeltaFile is required because we may not have // deletes for the bucket being taken into consideration for this split processing. - if (length != -1 && fs.exists(deleteDeltaFile)) { - /** - * todo: we have OrcSplit.orcTail so we should be able to get stats from there - */ - Reader deleteDeltaReader = OrcFile.createReader(deleteDeltaFile, - OrcFile.readerOptions(conf).maxLength(length)); + if (fs.exists(deleteDeltaFile)) { Review comment: We might wan to get rid of this exists to save NN calls, and just handle FileNotFoundException in case of missing file (exists does the same inside it's method) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 442855) Time Spent: 20m (was: 10m) > VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete > delta directories multiple times > ------------------------------------------------------------------------------------------------------------ > > Key: HIVE-23597 > URL: https://issues.apache.org/jira/browse/HIVE-23597 > Project: Hive > Issue Type: Improvement > Reporter: Rajesh Balamohan > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1562] > {code:java} > try { > final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit); > if (deleteDeltaDirs.length > 0) { > int totalDeleteEventCount = 0; > for (Path deleteDeltaDir : deleteDeltaDirs) { > {code} > > Consider a directory layout like the following. This was created by having > simple set of "insert --> update --> select" queries. > > {noformat} > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000001 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000002 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000003_0000003_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000004_0000004_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000005_0000005_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000006_0000006_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000007_0000007_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000008_0000008_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000009_0000009_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000010_0000010_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000011_0000011_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000012_0000012_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000013_0000013_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000003_0000003_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000004_0000004_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000005_0000005_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000006_0000006_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000007_0000007_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000008_0000008_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000009_0000009_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000010_0000010_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000011_0000011_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000012_0000012_0000 > /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000013_0000013_0000 > {noformat} > > Orcsplit contains all the delete delta folder information. For the directory > layout like this, it would create {{~12 splits}}. For every split, it > constructs "ColumnizedDeleteEventRegistry" in VRBAcidReader and ends up > reading all these delete delta folders multiple times. > In this case, it would read it approximately {{121 times!}}. > This causes huge delay in running simple queries like "{{select * from > tab_x}}" in cloud storage. > -- This message was sent by Atlassian Jira (v8.3.4#803005)