[ 
https://issues.apache.org/jira/browse/KYLIN-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498439#comment-17498439
 ] 

hujiahua commented on KYLIN-5163:
---------------------------------

I think the root cause of this issue was that the fileoutputcommitter mechanism 
is not used for writing dictionary files stage. I will create a PR for this.

> Global dictionary build job may produced incomplete dictionary file
> -------------------------------------------------------------------
>
>                 Key: KYLIN-5163
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5163
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v4.0.1
>            Reporter: hujiahua
>            Priority: Major
>
> The current dictionary spark build job uses function 
> `NBucketDictionary.saveBucketDict` to write dictionary files (include CURR 
> file and PREV file) for each partition. But it does not consider that there 
> may be concurrency multiple tasks for one same partition, such as scenarios 
> like task retry or speculation task. Concurrency multiple tasks of one 
> partition may cause incomplete dictionary file and we've encountered this 
> issue in production.
> I describe the issue in terms of timeline: 
> 1. currently in the dictionary building phase, one executor called E1 was 
> preparing to build dictionary file for partition 0 
> 2. driver sent E1  shutdown message because of YARN resource preemption. Then 
> driver mark the task of partition 0 failed and created a retry task to 
> another executor called E2.
> 3. E2 began to proccess task, and finished task in a short time.
> 4. after E2 finished task, E1 began to proccess task, so E1 delete complete 
> dictionary file which was created by E2 and created new dictionary file to 
> write.
> 5. Then E1 received driver's shutdown message and kill himself, finally left 
> a incomplete dictionary file which was not finished.
> 6. after other partition finished, the stage was marked successfull.
> 7. when next phase table encoding using incomplete dictionary file, stage 
> will failed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to