[
https://issues.apache.org/jira/browse/IMPALA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971338#comment-16971338
]
Vihang Karajgaonkar edited comment on IMPALA-9140 at 11/11/19 7:12 AM:
-----------------------------------------------------------------------
Thanks [~stigahuang] for the detailed explanation. The reason I see it as
redundant is as follows:
For async load requests (could be backgroundLoad or prioritized, doesn't matter
since they are all adding the table name to the FIFO queue)
1. A thread in {{TableLoadingMgr.startTableLoadingThreads}} pool makes sure
that the table name which is added in the queue is submitted to the loading
pool {{tblLoadingPool_}}
2. The async threads call {{loadNextTable}} which internally just picks up the
element from the queue and calls {{getOrLoadTable}}. {{getOrLoadTable}} calls
{{replaceTableIfUnchanged}} which adds the table to the catalog.
In case of sync load tables calls during the DDL processing, CatalogOpExecutor
directly issues a {{getOrLoadTable}} which internally calls
{{tableLoadingMgr_.loadAsync}} which does a {{tblLoadingPool_.execute()}}. It
then calls {{replaceTableIfUnchanged}} which adds the table to the catalog.
If I understand the above code flow correctly, the only difference between a
sync task and a prioritized async load task is that sync task bypasses the
queue and directly gets executed when the thread in tblLoadingPool_ is
available.
Without loss of generality consider the tblLoadingPool_ is of size 1. Now
consider the following 2 possibilities:
a. tblLoadingPool_ is busy doing other load.
In this case, a prioritized load request will wait until the thread in the
tblLoadingPool_ is available. A sync load request will also wait for the
current load to complete. In fact in this case it's a race between the sync
request and prioritized load request and there is no guarantee that the sync
load request is picked up before prioritized load request.
b. The thread available in the tblLoadingPool_
In this case again, either a prioritized load or sync load request is executed
depending on timing of which task is submitted first.
So the behavior seems to be, if there is capacity in {{tblLoadingPool_}} then,
either a prioritized load or sync load can be executed. If there is no
prioritized or sync load request then the background load request can be
executed if available. Offcourse, a load which is in progress is never evicted,
so it is possible that the prioritized load and sync load are both waiting for
background load to complete.
This same behavior can be achieved by having just one pool {{tblLoadingPool_}}
which keeps picking the tableName from the queue's head, loading it and adding
it in the catalog. The backgroundLoad request adds the tablename to the tail of
the queue. Prioritized and sync load requests add it to the head of the queue.
I may be missing something though. Please let me know if you think there is
something wrong in the above understanding.
was (Author: vihangk1):
Thanks [~stigahuang] for the detailed explanation. The reason I see it as
redundant is as follows:
For async load requests (could be backgroundLoad or prioritized, doesn't matter
since they are all adding the table name to the FIFO queue)
1. A thread in {{TableLoadingMgr.startTableLoadingThreads}} pool makes sure
that the table name which is added in the queue is submitted to the loading
pool {{tblLoadingPool_}}
2. The async threads call {{loadNextTable}} which internally just picks up the
element from the queue and calls {{getOrLoadTable}}. {{getOrLoadTable}} calls
{{replaceTableIfUnchanged}} which adds the table to the catalog.
In case of sync load tables calls during the DDL processing, CatalogOpExecutor
directly issues a {{getOrLoadTable}} which internally calls
{{tableLoadingMgr_.loadAsync}} which does a {{tblLoadingPool_.execute()}}
> Get rid of the unnecessary load submitter thread pool in tblLoadingMgr
> ----------------------------------------------------------------------
>
> Key: IMPALA-9140
> URL: https://issues.apache.org/jira/browse/IMPALA-9140
> Project: IMPALA
> Issue Type: Bug
> Reporter: Vihang Karajgaonkar
> Priority: Major
>
> This JIRA is created as a followup on the discussion on
> https://gerrit.cloudera.org/#/c/14611 related to various pools used for
> loading tables.
> It looks like there are 2 pools of threads both of the size
> {{num_metadata_loading_threads}}. One pool is used to submit the load
> requests to another pool {{tblLoadingPool_}} which does the actual loading of
> the tables. I think we can get rid of the pool which submits the tasks since
> it is not very time-consuming operation and can be done synchronously (all it
> needs to do submit the task in the queue in the front or back based on
> whether its a prioritized load or background load). This will simplify the
> loading code and reduce unnecessary number of threads being created by
> {{TblLoadingMgr}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]