[
https://issues.apache.org/jira/browse/IMPALA-7627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644035#comment-16644035
]
Todd Lipcon commented on IMPALA-7627:
-------------------------------------
I also found that this was a very expensive part of metadata loading. In the
end, it's only used when performing preflight checks on INSERT queries to
ensure that the target partitions are writable, so for a lot of use cases the
work is entirely unnecessary.
In IMPALA-7321 I suggested that we might remove this stuff entirely. An
alternate would be to defer the work until planning an INSERT query, and only
check the partitions which might be targets. In the case of a dynamic
partitioned insert, we'd still have to check all partitions, which would be
expensive, but maybe it's worth it to avoid the cost at table-load time?
Another alternative suggested there is to only check the top-level of the table
rather than every partition, in the case that the partitions are all in
"standard" locations. If someone has managed to chmod one of the partition
subdirectories inappropriately, we'll just fail at runtime instead of planning
time. So, in the common case where people aren't manually messing with
partitions inside their warehouse directory, we can just avoid all this work
entirely.
I'm not against parallelizing as suggested in this JIRA, but it's probably a
good time to evaluate the above alternatives.
> Parallel the fetching permission process
> ----------------------------------------
>
> Key: IMPALA-7627
> URL: https://issues.apache.org/jira/browse/IMPALA-7627
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Peikai Zheng
> Assignee: Peikai Zheng
> Priority: Major
>
> There are three phases when the Catalogd loading the metadata of a table.
> Firstly, the Catalogd fetches the metadata from Hive metastore;
> Then, the Catalogd fetches the permission of each partition from HDFS
> NameNode;
> Finally, the Catalogd loads the file descriptor from HDFS NameNode.
> According to my test result(Based on commit
> *11554a17c75b242767d5a50d66bc2874aa545c77*):
> ||Average Time(GetFileInfoThread=10)||phase 1||phase 2||phase 3||
> |idm.sauron_message|9.9917115|459.2106944|95.0179163|
> |default.revenue_enriched|12.3377474|111.2969046|40.827472|
> |default.upp_raw_prod|1.5143162|50.0251426|12.6805323|
> |default.hit_to_beacon_playback_prod|1.4294509|49.7670539|18.3557858|
> |default.sitetracking_enriched|13.0003804|112.8746656|42.1824032|
> |default.player_custom_event|9.2618705|493.4865302|116.4986184|
> |default.revenue_day_est|57.9116561|106.5028664|24.005822|
> Detailed Information of tables:
> ||Table||#Partitions||#Files||Size(without replica) / TB||Size(with replica)
> / TB||
> |idm.sauron_message|12923|69537|44.4|90.3|
> |default.revenue_enriched|1809|1832001|145.5|308.6|
> |default.upp_raw_prod|801|480000|186.3|424|
> |default.hit_to_beacon_playback_prod|777|793900|46.6|139.9|
> |default.sitetracking_enriched|1809|1842049|21.7|65|
> |default.player_custom_event|8816|2197096|47.2|141.5|
> |default.revenue_day_est|1731|109815|25.9|77.6|
> So, I suggest to parallel the second phase.The majority of the time occupied
> by the second phase.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]