[ 
https://issues.apache.org/jira/browse/IMPALA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971338#comment-16971338
 ] 

Vihang Karajgaonkar edited comment on IMPALA-9140 at 11/11/19 7:12 AM:
-----------------------------------------------------------------------

Thanks [~stigahuang] for the detailed explanation. The reason I see it as 
redundant is as follows:

For async load requests (could be backgroundLoad or prioritized, doesn't matter 
since they are all adding the table name to the FIFO queue)
1. A thread in {{TableLoadingMgr.startTableLoadingThreads}} pool makes sure 
that the table name which is added in the queue is submitted to the loading 
pool {{tblLoadingPool_}}
2.  The async threads call {{loadNextTable}} which internally just picks up the 
element from the queue and calls {{getOrLoadTable}}. {{getOrLoadTable}} calls 
{{replaceTableIfUnchanged}} which adds the table to the catalog.

In case of sync load tables calls during the DDL processing, CatalogOpExecutor 
directly issues a {{getOrLoadTable}} which internally calls 
{{tableLoadingMgr_.loadAsync}} which does a {{tblLoadingPool_.execute()}}. It 
then calls {{replaceTableIfUnchanged}} which adds the table to the catalog.

If I understand the above code flow correctly, the only difference between a 
sync task and a prioritized async load task is that sync task bypasses the 
queue and directly gets executed when the thread in tblLoadingPool_ is 
available.

Without loss of generality consider the tblLoadingPool_ is of size 1. Now 
consider the following 2 possibilities:
a. tblLoadingPool_ is busy doing other load.
In this case, a prioritized load request will wait until the thread in the 
tblLoadingPool_ is available. A sync load request will also wait for the 
current load to complete. In fact in this case it's a race between the sync 
request and prioritized load request and there is no guarantee that the sync 
load request is picked up before prioritized load request.
b. The thread available in the tblLoadingPool_
In this case again, either a prioritized load or sync load request is executed 
depending on timing of which task is submitted first. 

So the behavior seems to be, if there is capacity in {{tblLoadingPool_}} then, 
either a prioritized load or sync load can be executed. If there is no 
prioritized or sync load request then the background load request can be 
executed if available. Offcourse, a load which is in progress is never evicted, 
so it is possible that the prioritized load and sync load are both waiting for 
background load to complete.

This same behavior can be achieved by having just one pool {{tblLoadingPool_}} 
which keeps picking the tableName from the queue's head, loading it and adding 
it in the catalog. The backgroundLoad request adds the tablename to the tail of 
the queue. Prioritized and sync load requests add it to the head of the queue.

I may be missing something though. Please let me know if you think there is 
something wrong in the above understanding.


was (Author: vihangk1):
Thanks [~stigahuang] for the detailed explanation. The reason I see it as 
redundant is as follows:

For async load requests (could be backgroundLoad or prioritized, doesn't matter 
since they are all adding the table name to the FIFO queue)
1. A thread in {{TableLoadingMgr.startTableLoadingThreads}} pool makes sure 
that the table name which is added in the queue is submitted to the loading 
pool {{tblLoadingPool_}}
2.  The async threads call {{loadNextTable}} which internally just picks up the 
element from the queue and calls {{getOrLoadTable}}. {{getOrLoadTable}} calls 
{{replaceTableIfUnchanged}} which adds the table to the catalog.

In case of sync load tables calls during the DDL processing, CatalogOpExecutor 
directly issues a {{getOrLoadTable}} which internally calls 
{{tableLoadingMgr_.loadAsync}} which does a {{tblLoadingPool_.execute()}}

> Get rid of the unnecessary load submitter thread pool in tblLoadingMgr
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-9140
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9140
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Vihang Karajgaonkar
>            Priority: Major
>
> This JIRA is created as a followup on the discussion on 
> https://gerrit.cloudera.org/#/c/14611 related to various pools used for 
> loading tables.
> It looks like there are 2 pools of threads both of the size 
> {{num_metadata_loading_threads}}. One pool is used to submit the load 
> requests to another pool {{tblLoadingPool_}} which does the actual loading of 
> the tables. I think we can get rid of the pool which submits the tasks since 
> it is not very time-consuming operation and can be done synchronously (all it 
> needs to do submit the task in the queue in the front or back based on 
> whether its a prioritized load or background load). This will simplify the 
> loading code and  reduce unnecessary number of threads being created by 
> {{TblLoadingMgr}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to