[
https://issues.apache.org/jira/browse/KYLIN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yaqian Zhang reassigned KYLIN-4000:
-----------------------------------
Assignee: Yaqian Zhang
> Merge task hang at step #5: Convert Cuboid Data to HFile
> --------------------------------------------------------
>
> Key: KYLIN-4000
> URL: https://issues.apache.org/jira/browse/KYLIN-4000
> Project: Kylin
> Issue Type: Bug
> Components: Job Engine
> Affects Versions: v2.6.1
> Reporter: Tuo Zhu
> Assignee: Yaqian Zhang
> Priority: Blocker
>
> Auto triggered merge tasks all hanged at step: Convert Cuboid Data to HFile.
> The last reduce step in "Kylin_HFile_Generator_mycube_Step".
> Hadoop UI shows job status:
> attempt_1555603539483_1251_r_000000_0 78.56 RUNNING reduce > reduce
> From the syslog, following outputs are printed at first few hours:
> {code}
> 2019-05-10 22:27:48,699 INFO [IPC Server handler 27 on 46084]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> attempt_1555603539483_1149_r_000000_0 is : 0.8629564
> 2019-05-10 22:28:24,743 INFO [Socket Reader #1 for port 46084|#1 for port
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> job_1555603539483_1149 (auth:SIMPLE)
> 2019-05-10 22:28:24,746 INFO [IPC Server handler 23 on 46084]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> attempt_1555603539483_1149_r_000000_0 is : 0.86303306
> 2019-05-10 22:29:00,794 INFO [Socket Reader #1 for port 46084|#1 for port
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> job_1555603539483_1149 (auth:SIMPLE)
> 2019-05-10 22:29:00,796 INFO [IPC Server handler 20 on 46084]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> attempt_1555603539483_1149_r_000000_0 is : 0.86305153
> 2019-05-10 22:29:36,833 INFO [Socket Reader #1 for port 46084|#1 for port
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> job_1555603539483_1149 (auth:SIMPLE)
> 2019-05-10 22:29:36,836 INFO [IPC Server handler 3 on 46084]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> attempt_1555603539483_1149_r_000000_0 is : 0.8634469
> 2019-05-10 22:30:12,883 INFO [Socket Reader #1 for port 46084|#1 for port
> 46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> job_1555603539483_1149 (auth:SIMPLE)
> 2019-05-10 22:30:12,887 INFO [IPC Server handler 18 on 46084]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> attempt_1555603539483_1149_r_000000_0 is : 0.8643631
> {code}
> After that, only ZK logs keeps printed following lines constantly:
> {code}
> 2019-05-11 00:02:23,445 WARN [Thread-1388778]
> org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but
> not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
> 2019-05-11 00:02:23,452 INFO [main]
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Connect 0x29758ab9 to
> boron:2181,carbon:2181,neon:2181 with session timeout=90000ms, retries 6,
> retry interval 1000ms, keepAlive=60000ms
> 2019-05-11 00:02:23,452 INFO
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9]
> org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=bo
> ron:2181,carbon:2181,neon:2181 sessionTimeout=90000
> watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$24/171414890@19fbce98
> 2019-05-11 00:02:23,452 INFO
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
> org.apache.zookeeper.ClientCnxn: Opening socket connection to server
> boron/172.16.211.20:2181. Will not attempt to authenticate using SASL
> (unknown error)
> 2019-05-11 00:02:23,453 INFO
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
> org.apache.zookeeper.ClientCnxn: Socket connection established, initiating
> session, client: /172.16.211.8:52214, server: boron/172.16.211.20:2181
> 2019-05-11 00:02:23,453 INFO
> [ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
> org.apache.zookeeper.ClientCnxn: Session establishment complete on server
> boron/172.16.211.20:2181, sessionid = 0x16a061f0770a354, negotiated timeout =
> 60000
> 2019-05-11 00:02:23,455 INFO [main]
> org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Close zookeeper
> connection 0x29758ab9 to boron:2181,carbon:2181,neon:2181
> 2019-05-11 00:02:23,455 INFO [main]
> org.apache.hadoop.hbase.io.hfile.CacheConfig: Created cacheConfig:
> CacheConfig:disabled
> 2019-05-11 00:02:23,456 INFO [main]
> org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2:
> Writer=hdfs://berylium:8020/kylin/kylin_metadata/kylin-772502e8-c126-fec2-5808-08aa2df1fdea/pbs_uv_daily/hfile/_temporary/1/_temporary/attempt_1555603539483_1149_r_000000_0/F1/fdbf1bd8933242f2bd1da40a03c7a635,
> wrote=43
> 2019-05-11 00:02:23,457 WARN [Thread-1388781]
> org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but
> not chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
> {code}
> It lasts for tens of hours. No error, no exceptions. In the end, it created
> too many blocks in HDFS. It only happens when merge task is triggered. Daily
> build jobs went very good.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)