[
https://issues.apache.org/jira/browse/IMPALA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770356#comment-17770356
]
Csaba Ringhofer commented on IMPALA-12472:
------------------------------------------
A related comment:
https://issues.apache.org/jira/browse/IMPALA-7539?focusedCommentId=16876527&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16876527
It suggests to completely skip permission check with the warehouse directory.
Adding a flag for this would be simple and could be a huge performance
improvement:
HdfsTable checks here whether to assume read+write access on a given filesystem
or access checks are needed. We could also pass the path to check whether it is
a subdiractory of a path where we assume read+write access.
I would consider creating a flag that can have a list of paths instead of
having a bool on whether to skip on the warehouse. This would allow more fine
grained control, e.g. assuming some external locations as having read+write
access.
> Skip permission check when refreshing in event processor
> --------------------------------------------------------
>
> Key: IMPALA-12472
> URL: https://issues.apache.org/jira/browse/IMPALA-12472
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Reporter: Csaba Ringhofer
> Priority: Major
>
> Saw callstacks where most of EventProcessor's time is spent in rechecking
> access level for partition directories
> {code}
> org.apache.impala.catalog.HdfsTable.getAvailableAccessLevel
> org.apache.impala.catalog.HdfsTable.createOrUpdatePartitionBuilder
> org.apache.impala.catalog.HdfsTable.createPartitionBuilder
> org.apache.impala.catalog.HdfsTable.reloadPartitions
> org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames
> org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExisorg.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions
> org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process
> {code}
> HdfsTable.getAvailableAccessLevel() does a getFileStatus(), and if access
> control list bit is set in the status, a getAclStatus() call to the namenode.
> It is questionable whether we should recheck this during refreshing tables
> for directories that were already checked in the past, as it can be expensive
> and is unlikely to change. AFAIK having stale data shouldn't cause security
> issues, as if Impala has no right to access/modify the file, the name node
> will simply not allow this operation (coordinators/executors use the same
> username as catalogd for HDFS ops).
> Note that the whole access level check is skipped for most other filesystems
> than HDFS (see HdfsTable.assumeReadWriteAccess()).
> Currently catalogd checks this for each partition level event (even if they
> are batched). While checking it once during CREATE PARTITON makes sense,
> rechecking it for every INSERT and ALTER seems like an overkill - especially
> an INSERT shouldn't reduce access rights on a partition table.
> Besides event processor, rechecking during REFRESH and reloads after
> DML/DDLs are also questionable. If there was an actual change, INVALIDATE
> METADATA can be used to reload the table from scratch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]