Unexplainable Bug in Flume Collectors

Thomas Vachon Fri, 27 Jan 2012 07:30:24 -0800

I have 10 logical collectors per a collector node.  2 for each log file I 
monitor (one for HDFS, one for S3 sinks).  I recently went from 8 to 10.  The 
10th sink is failing 100% of the time.  On a node I see:


2012-01-27 15:19:05,104 INFO 
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 
20120127-142931487+0000.6301485683111842.00000106
2012-01-27 15:19:05,105 INFO 
com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on event 
'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] { 
AckChecksum : (long)1327674571487  (string) '5?:?' (double)6.559583946287E-312 
} { AckTag : 20120127-142931487+0000.6301485683111842.00000106 } { AckType : 
beg } ' with error: Append failed java.net.SocketException: Broken pipe
2012-01-27 15:19:05,105 INFO 
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 
35862 closed
2012-01-27 15:19:05,106 INFO 
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 
35862 closed
2012-01-27 15:19:05,108 INFO 
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to 
admin.internal.sessionm.com:35862 opened
2012-01-27 15:19:05,108 INFO 
com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to 
ip-10-194-66-32.ec2.internal:35862 opened
2012-01-27 15:19:05,108 INFO 
com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened 
BackoffFailover on try 0
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 
20120127-142931487+0000.6301485683111842.00000106 is queued to be checked
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end 
of file NaiveFileWALManager 
(dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,109 INFO 
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 
20120127-143151597+0000.6301625792799805.00000106
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 
20120127-143151597+0000.6301625792799805.00000106 is queued to be checked
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end 
of file NaiveFileWALManager 
(dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,110 INFO 
com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 
20120127-151120751+0000.6303994947346458.00000351
2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 
20120127-151120751+0000.6303994947346458.00000351 is queued to be checked
2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end 
of file NaiveFileWALManager 
(dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )


On the collector, I see flow 9 (the HDFS sink flow of the same log file) 
working just fine.  I see that it opens the s3n sink for the S3 flow, but no 
data is being ingested.  On the node I see that we are seeing "Broken Pipe".  I 
suspect that is the problem, but I am unable to find a way to fix it.  I 
confirmed connectivity via telnet to the RCP source port.

I have exhausted all possible measure of fixing this.  I have unmapped, 
configured, and remapped every node to ensure we did not have a weird problem.  
The master shows no errors, and 9 out of the 10 flows are working just as they 
should.

Does anyone have an idea?

--
Thomas Vachon
Principal Operations Architect
session M
vac...@sessionm.com

signature.asc
Description: Message signed with OpenPGP using GPGMail

Unexplainable Bug in Flume Collectors

Reply via email to