I have 10 logical collectors per a collector node. 2 for each log file I monitor (one for HDFS, one for S3 sinks). I recently went from 8 to 10. The 10th sink is failing 100% of the time. On a node I see:
2012-01-27 15:19:05,104 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-142931487+0000.6301485683111842.00000106 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] { AckChecksum : (long)1327674571487 (string) '5?:?' (double)6.559583946287E-312 } { AckTag : 20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' with error: Append failed java.net.SocketException: Broken pipe 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed 2012-01-27 15:19:05,106 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to admin.internal.sessionm.com:35862 opened 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to ip-10-194-66-32.ec2.internal:35862 opened 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened BackoffFailover on try 0 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-142931487+0000.6301485683111842.00000106 is queued to be checked 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 ) 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-143151597+0000.6301625792799805.00000106 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-143151597+0000.6301625792799805.00000106 is queued to be checked 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 ) 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-151120751+0000.6303994947346458.00000351 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-151120751+0000.6303994947346458.00000351 is queued to be checked 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 ) On the collector, I see flow 9 (the HDFS sink flow of the same log file) working just fine. I see that it opens the s3n sink for the S3 flow, but no data is being ingested. On the node I see that we are seeing "Broken Pipe". I suspect that is the problem, but I am unable to find a way to fix it. I confirmed connectivity via telnet to the RCP source port. I have exhausted all possible measure of fixing this. I have unmapped, configured, and remapped every node to ensure we did not have a weird problem. The master shows no errors, and 9 out of the 10 flows are working just as they should. Does anyone have an idea? -- Thomas Vachon Principal Operations Architect session M vac...@sessionm.com
signature.asc
Description: Message signed with OpenPGP using GPGMail