Hi eagle dev:
We use eagle for Cloudera CDH cluster. We use eagle with official website
tutorial.
runing OK for long time .
But today , the kafka cluster has crash , and because the kafka crash lead
to namenode error.
The standby namenode auto trans to active status and lead to hadoop cluster
error.
We think , the send hdfs_audit log should be a single daemon , Should not
be configuration namenode log4j file , and by namenode start to load kakfa jars
Because the namenode and 'send to kafka' these two in a single jvm daemon ,
that can crash namenode cause the kafka down.
We think, eagle should design a single daemon to send hdfs audit log to
kafka, should be decoupling not enhanced coupling.
English is not good , you can understand is ok.
I know eagle dev team have some chinese people so you team should
understand chinese:
我们通过官方文档去配置eagle,按照文档说的配置namenode的log4j配置并将eagle的相关jar包放入namenode的classpath下,当重启namenode后,成功将
hdfs audit log 发送到kafka 并稳定运行了一段时间,
但是今天,kafka集群宕机了,导致namenode出现问题,datanode连接namenode出现超时,备用namenode开始接管集群,但是原先的活动namenode仍然标记为活动状态,最终导致
hadoop集群出现问题,
排查问题后发现,当kafka宕机后,namenode也出现异常,并导致了namenode的问题出现。
我们建议,不应该将发送至kafka的功能绑定到namenode之中,应当将这两者解耦,设计一个单独的进程去读取audit日志文件并发送至kafka
这样的话 当kafka宕机后 不会对namenode造成影响。
谢谢。
[email protected]