[ 
https://issues.apache.org/jira/browse/IMPALA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971425#comment-16971425
 ] 

Quanlong Huang commented on IMPALA-9140:
----------------------------------------

Thanks [~vihangk1]'s detailed reply. I think we are aligned on the code 
understanding. The key part we need to discuss is how the second pool in 
{{TableLoadingMgr.startTableLoadingThreads}} balances async and sync load 
requests to show the priorities.

Let's still use the above case that tblLoadingPool_ is of size 1. Then the 
second pool in {{TableLoadingMgr.startTableLoadingThreads}} is of size 1 too.

In scenario b, yes, the first async load request and the other sync load 
requests have the same possibility to be executed depending on the submit time. 
However, the other async load requests in tableLoadingDeque_ don't have the 
chance, since {{loadNextTable()}} only returns after the first submitted task 
finishes. And the second async request is submitted later than those sync load 
requests so will be executed later. The same as other async load requests.

In short, async load requests only have chance to run after all *already 
pending* sync load requests finish in the above case (poolSize=1). This is what 
I think how the second pool balances the sync and async loads.
{quote}This same behavior can be achieved by having just one pool 
tblLoadingPool_ which keeps picking the tableName from the queue's head, 
loading it and adding it in the catalog. The backgroundLoad request adds the 
tablename to the tail of the queue. Prioritized and sync load requests add it 
to the head of the queue.
{quote}
Adding prioritized and sync load requests to the head can't respect the 
priority. Because later prioritized load requests can still be put before the 
already waiting sync requests. I think maybe a PriorityQueue is a better fit. 
And the priority can be defined as  sync load > prioritized load > background 
load. But it's different with current behavior. Async load requests may starve 
if there are always new sync load requests jumpping in. So I still think the 
current implementation makes some sense.

This is my limited understanding. Please correct me if anything is wrong.

> Get rid of the unnecessary load submitter thread pool in tblLoadingMgr
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-9140
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9140
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Vihang Karajgaonkar
>            Priority: Major
>
> This JIRA is created as a followup on the discussion on 
> https://gerrit.cloudera.org/#/c/14611 related to various pools used for 
> loading tables.
> It looks like there are 2 pools of threads both of the size 
> {{num_metadata_loading_threads}}. One pool is used to submit the load 
> requests to another pool {{tblLoadingPool_}} which does the actual loading of 
> the tables. I think we can get rid of the pool which submits the tasks since 
> it is not very time-consuming operation and can be done synchronously (all it 
> needs to do submit the task in the queue in the front or back based on 
> whether its a prioritized load or background load). This will simplify the 
> loading code and  reduce unnecessary number of threads being created by 
> {{TblLoadingMgr}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to