wangbo opened a new issue #4058:
URL: https://github.com/apache/incubator-doris/issues/4058


   **Why build global dict needs a lock**
   We use ```insert dict_table as select xxx from global dict``` sql to update 
dict table.
   This sql contains a ```read,change,overwrite``` logic.
   If multiple load job update dict table concurrently, the data of dict table 
is wrong.
   So we need to guarantee write to dict is a mutually exclusive operation.
   
   **Short-term Solution:LoadJobScheduler submit task one by one**
   1 LoadJobScheduler fetch a LoadJob which contains bitmap columns from queue, 
names ```job_A```
   2 LoadJobScheduler get ```job_A```'s doris table, names ```doris_tab_A```.
   3 LoadJobScheduler traverses ```loadTaskScheduler```'s runningTasks to check 
whether there is already a running task which loading data to 
```doris_tab_A```.If there is ,then skip ```job_A```.
   
   **Some questions about Short-term Solution**
   1 If user submit too many load stmt,LoadJobScheduler's queue may be too 
long. So we need to forbid submiting new load stmt sql when there is too many 
load job which needs build dict.
   2 The advantage of this solution is that it's easy to implement.The 
disadvantage of this solution is that the granularity of the lock is too big.
   
   Currently I prefer to use short-term solution to quickly land ```Spark 
Load``` in env prod.
   But we sill has a ultimate solution to discuss.
   
   **Ultimate Solution: We need a distributed lock**
   **Using Zookeeper**
   Actually this is the quickest solution and it's graceful enough to implement 
in code level.
   But this will introduce additional components for doris which will make  
whole architecture more complicated.
   
   **Implement a distributed lock in FE**
   1 FE keys a lock set which contains lockname.
   2 Add lock.Client calls http api to write lockname to FE's lock set.
   3 Unlock.
   3.1 Client calls http api to remove lockname from FE's lock set.
   3.2 Client keeps a heartbeat report to FE,if the heart beat timeout,then FE 
will release the lock.
   
   **Some questions about implement a distributed lock in FE**
   1 When FE leader changes, how new leader get lock set?
      For client which owns lock and find leader changes, then will try to lock 
again and tell new leader to own lock again.
      For new leader, it won't accept a new add lock request until client 
timeout to wait current lock onwer's request.
   
   2 How does Client know which FE to connect?
   2.1 Use Domain
   2.2 Client keeps all FE's host.
   
   3 How to deal the case that a client which owns the lock timeout and 
reconnected again?
   First it shouldn't be owned the lock event reconnected.
   Then the global dict may already be wrong, and we need to reload data again.
   Using zk may also meet this case.
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to