[jira] [Commented] (IMPALA-14138) Detect if filesystem is not co-located in which case we shouldn't collect block location information

ASF subversion and git services (Jira) Thu, 17 Jul 2025 09:24:05 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007871#comment-18007871
 ]


ASF subversion and git services commented on IMPALA-14138:
----------------------------------------------------------

Commit 438461db9e82be1d91f4320cce5ab08bc895a305 in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=438461db9 ]

IMPALA-14138: Manually disable block location loading via Hadoop config

For storage systems that support block location information (HDFS,
Ozone) we always retrieve it with the assumption that we can use it for
scheduling, to do local reads. But it's also typical that Impala is not
co-located with the storage system, not even in on-prem deployments.
E.g. when Impala runs in containers, and even if they are co-located,
we don't try to figure out which container runs on which machine.

In such cases we should not reach out to the storage system to collect
file information because it can be very expensive for large tables and
we won't benefit from it at all. Since currently there is no easy way
to tell if Impala is co-located with the storage system this patch
adds configuration options to disable block location retrieval during
table loading.

It can be disabled globally via Hadoop Configuration:

'impala.preload-block-locations-for-scheduling': 'false'

We can restrict it to filesystem schemes, e.g.:

'impala.preload-block-locations-for-scheduling.scheme.hdfs': 'false'

When multiple storage systems are configured with the same scheme, we
can still control block location loading based on authority, e.g.:

'impala.preload-block-locations-for-scheduling.authority.mycluster': 'false'

The latter only disables block location loading for URIs like
'hdfs://mycluster/warehouse/tablespace/...'

If block location loading is disabled by any of the switches, it cannot
be re-enabled by another, i.e. the most restrictive setting prevails.
E.g:
  disable scheme 'hdfs', enable authority 'mycluster'
     ==> hdfs://mycluster/ is still disabled

  disable globally, enable scheme 'hdfs', enable authority 'mycluster'
     ==> hdfs://mycluster/ is still disabled, as everything else is.

Testing:
 * added unit tests for FileSystemUtil
 * added unit tests for the file metadata loaders
 * custom cluster tests with custom Hadoop configuration

Change-Id: I1c7a6a91f657c99792db885991b7677d2c240867
Reviewed-on: http://gerrit.cloudera.org:8080/23175
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Detect if filesystem is not co-located in which case we shouldn't collect 
> block location information
> ----------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-14138
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14138
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> For storage systems that support block location information (HDFS, Ozone) we 
> always retrieve it with the assumption that we can use it for scheduling, to 
> do local reads:
> [https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java#L154]
> But it's also typical that Impala is not co-located with the storage system, 
> not even in on-prem deployments. E.g. PVC DS Impala runs in containers, and 
> even if they are co-located, I don't think we try to figure out which 
> container runs on which machine. Also short-circuit reads are off the 
> table.In such cases we should not reach out to the storage system to collect 
> file information because it can be very expensive for large tables and we 
> won't benefit from it at all.
>  
> We could construct the file descriptors based on the information we have in 
> the Iceberg manifests, and this is what we already do on cloud storage 
> systems.
> Is there a good indicator that Impala is not co-located with the configured 
> filesystems? E.g. if data cache is enabled? But of course there could be 
> multiple filesystems configured, some of them co-located, some not..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-14138) Detect if filesystem is not co-located in which case we shouldn't collect block location information

Reply via email to