[
https://issues.apache.org/jira/browse/IMPALA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971425#comment-16971425
]
Quanlong Huang commented on IMPALA-9140:
----------------------------------------
Thanks [~vihangk1]'s detailed reply. I think we are aligned on the code
understanding. The key part we need to discuss is how the second pool in
{{TableLoadingMgr.startTableLoadingThreads}} balances async and sync load
requests to show the priorities.
Let's still use the above case that tblLoadingPool_ is of size 1. Then the
second pool in {{TableLoadingMgr.startTableLoadingThreads}} is of size 1 too.
In scenario b, yes, the first async load request and the other sync load
requests have the same possibility to be executed depending on the submit time.
However, the other async load requests in tableLoadingDeque_ don't have the
chance, since {{loadNextTable()}} only returns after the first submitted task
finishes. And the second async request is submitted later than those sync load
requests so will be executed later. The same as other async load requests.
In short, async load requests only have chance to run after all *already
pending* sync load requests finish in the above case (poolSize=1). This is what
I think how the second pool balances the sync and async loads.
{quote}This same behavior can be achieved by having just one pool
tblLoadingPool_ which keeps picking the tableName from the queue's head,
loading it and adding it in the catalog. The backgroundLoad request adds the
tablename to the tail of the queue. Prioritized and sync load requests add it
to the head of the queue.
{quote}
Adding prioritized and sync load requests to the head can't respect the
priority. Because later prioritized load requests can still be put before the
already waiting sync requests. I think maybe a PriorityQueue is a better fit.
And the priority can be defined as sync load > prioritized load > background
load. But it's different with current behavior. Async load requests may starve
if there are always new sync load requests jumpping in. So I still think the
current implementation makes some sense.
This is my limited understanding. Please correct me if anything is wrong.
> Get rid of the unnecessary load submitter thread pool in tblLoadingMgr
> ----------------------------------------------------------------------
>
> Key: IMPALA-9140
> URL: https://issues.apache.org/jira/browse/IMPALA-9140
> Project: IMPALA
> Issue Type: Bug
> Reporter: Vihang Karajgaonkar
> Priority: Major
>
> This JIRA is created as a followup on the discussion on
> https://gerrit.cloudera.org/#/c/14611 related to various pools used for
> loading tables.
> It looks like there are 2 pools of threads both of the size
> {{num_metadata_loading_threads}}. One pool is used to submit the load
> requests to another pool {{tblLoadingPool_}} which does the actual loading of
> the tables. I think we can get rid of the pool which submits the tasks since
> it is not very time-consuming operation and can be done synchronously (all it
> needs to do submit the task in the queue in the front or back based on
> whether its a prioritized load or background load). This will simplify the
> loading code and reduce unnecessary number of threads being created by
> {{TblLoadingMgr}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]