[ 
https://issues.apache.org/jira/browse/KYLIN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaqian Zhang reassigned KYLIN-4000:
-----------------------------------

    Assignee: Yaqian Zhang

> Merge task hang at step #5: Convert Cuboid Data to HFile
> --------------------------------------------------------
>
>                 Key: KYLIN-4000
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4000
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v2.6.1
>            Reporter: Tuo Zhu
>            Assignee: Yaqian Zhang
>            Priority: Blocker
>
> Auto triggered merge tasks all hanged at step: Convert Cuboid Data to HFile. 
> The last reduce step in "Kylin_HFile_Generator_mycube_Step".
> Hadoop UI shows job status:
>  attempt_1555603539483_1251_r_000000_0 78.56 RUNNING reduce > reduce
> From the syslog, following outputs are printed at first few hours:
> {code}
>  2019-05-10 22:27:48,699 INFO [IPC Server handler 27 on 46084] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1555603539483_1149_r_000000_0 is : 0.8629564
>  2019-05-10 22:28:24,743 INFO [Socket Reader #1 for port 46084|#1 for port 
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
> job_1555603539483_1149 (auth:SIMPLE)
>  2019-05-10 22:28:24,746 INFO [IPC Server handler 23 on 46084] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1555603539483_1149_r_000000_0 is : 0.86303306
>  2019-05-10 22:29:00,794 INFO [Socket Reader #1 for port 46084|#1 for port 
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
> job_1555603539483_1149 (auth:SIMPLE)
>  2019-05-10 22:29:00,796 INFO [IPC Server handler 20 on 46084] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1555603539483_1149_r_000000_0 is : 0.86305153
>  2019-05-10 22:29:36,833 INFO [Socket Reader #1 for port 46084|#1 for port 
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
> job_1555603539483_1149 (auth:SIMPLE)
>  2019-05-10 22:29:36,836 INFO [IPC Server handler 3 on 46084] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1555603539483_1149_r_000000_0 is : 0.8634469
>  2019-05-10 22:30:12,883 INFO [Socket Reader #1 for port 46084|#1 for port 
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
> job_1555603539483_1149 (auth:SIMPLE)
>  2019-05-10 22:30:12,887 INFO [IPC Server handler 18 on 46084] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1555603539483_1149_r_000000_0 is : 0.8643631
> {code}
>  After that, only ZK logs keeps printed following lines constantly:
> {code}
> 2019-05-11 00:02:23,445 WARN [Thread-1388778] 
> org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but 
> not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
>  2019-05-11 00:02:23,452 INFO [main] 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Connect 0x29758ab9 to 
> boron:2181,carbon:2181,neon:2181 with session timeout=90000ms, retries 6, 
> retry interval 1000ms, keepAlive=60000ms
>  2019-05-11 00:02:23,452 INFO 
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9] 
> org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=bo
>  ron:2181,carbon:2181,neon:2181 sessionTimeout=90000 
> watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$24/171414890@19fbce98
>  2019-05-11 00:02:23,452 INFO 
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
>  org.apache.zookeeper.ClientCnxn: Opening socket connection to server 
> boron/172.16.211.20:2181. Will not attempt to authenticate using SASL 
> (unknown error)
>  2019-05-11 00:02:23,453 INFO 
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
>  org.apache.zookeeper.ClientCnxn: Socket connection established, initiating 
> session, client: /172.16.211.8:52214, server: boron/172.16.211.20:2181
>  2019-05-11 00:02:23,453 INFO 
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
>  org.apache.zookeeper.ClientCnxn: Session establishment complete on server 
> boron/172.16.211.20:2181, sessionid = 0x16a061f0770a354, negotiated timeout = 
> 60000
>  2019-05-11 00:02:23,455 INFO [main] 
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Close zookeeper 
> connection 0x29758ab9 to boron:2181,carbon:2181,neon:2181
>  2019-05-11 00:02:23,455 INFO [main] 
> org.apache.hadoop.hbase.io.hfile.CacheConfig: Created cacheConfig: 
> CacheConfig:disabled
>  2019-05-11 00:02:23,456 INFO [main] 
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2: 
> Writer=hdfs://berylium:8020/kylin/kylin_metadata/kylin-772502e8-c126-fec2-5808-08aa2df1fdea/pbs_uv_daily/hfile/_temporary/1/_temporary/attempt_1555603539483_1149_r_000000_0/F1/fdbf1bd8933242f2bd1da40a03c7a635,
>  wrote=43
>  2019-05-11 00:02:23,457 WARN [Thread-1388781] 
> org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but 
> not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
> {code}
> It lasts for tens of hours. No error, no exceptions. In the end, it created 
> too many blocks in HDFS. It only happens when merge task is triggered. Daily 
> build jobs went very good.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to