[ 
https://issues.apache.org/jira/browse/KYLIN-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498565#comment-17498565
 ] 

ASF GitHub Bot commented on KYLIN-5163:
---------------------------------------

sleep1661 commented on pull request #1822:
URL: https://github.com/apache/kylin/pull/1822#issuecomment-1053510900


   @hit-lacus @zhangayqian  could you help take a look when you have time? 
Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Global dictionary build job may produce incomplete dictionary file
> ------------------------------------------------------------------
>
>                 Key: KYLIN-5163
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5163
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v4.0.1
>            Reporter: hujiahua
>            Priority: Major
>
> The current dictionary spark build job uses function 
> `NBucketDictionary.saveBucketDict` to write dictionary files (include CURR 
> file and PREV file) for each partition. But it does not consider that there 
> may be concurrency multiple tasks for one same partition, such as scenarios 
> like task retry or speculation task. Concurrency multiple tasks of one 
> partition may cause incomplete dictionary file and we've encountered this 
> issue in production.
> I describe the issue in terms of timeline: 
> 1. currently in the dictionary building phase, one executor called E1 was 
> preparing to build dictionary file for partition 0 
> 2. driver sent E1  shutdown message because of YARN resource preemption. Then 
> driver mark the task of partition 0 failed and created a retry task to 
> another executor called E2.
> 3. E2 began to proccess task, and finished task in a short time.
> 4. after E2 finished task, E1 began to proccess task, so E1 delete complete 
> dictionary file which was created by E2 and created new dictionary file to 
> write.
> 5. Then E1 received driver's shutdown message and kill himself, finally left 
> a incomplete dictionary file which was not finished.
> 6. after other partition finished, the stage was marked successfull.
> 7. when next phase table encoding using incomplete dictionary file, stage 
> will failed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to