[
https://issues.apache.org/jira/browse/METRON-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159609#comment-16159609
]
ASF GitHub Bot commented on METRON-1153:
----------------------------------------
Github user ottobackwards commented on the issue:
https://github.com/apache/metron/pull/741
@justinleet My question with regards to the necessity for a refactor came
from wanting to handle the exception in the HDFSWriter, where the exceptions
where being caught at the time.
The fact that the tuples are 'unpaired' from their messages, and that we
handle them all together seems to me to be problematic if you want to take per
message/sourcehandle action.
I think your approach removes the immediate problem with that, although
having both the hdfs writer and the source handler itself take actions and
split the 'ownership' doesn't feel quite right to me.
@cestella It would be good to know why this is happening, but in truth, any
persistent network connection in a long lived 'forever' type application needs
to guard against these kinds of errors.
For reliability reasons, and where we want to get to, we require a more
clear documentation of the failure and recovery states of the writers, esp. the
hdfs as we are batching. We also need to understand all the ways the stream
can fail, to the extent that that is possible.
I am not sure that limiting it to this one case is enough, there will
still be many possible ways to end up with a very unclear situation, that the
writer is failing continuously and but the source handler is not removed from
'service'. Users, as we have seen on the list will be left to track through
system to work their way to this problem.
Failure to store is a critical failure of the system. Esp. in systems
where there are data retention rules or SLA's on data loss. Thus we need in
addition to the handling of this, an alerting strategy.
While this fix ( questions pending ) is an improvement of the symptom, it
does not address the higher level issue or severity of this problem.
> HDFS HdfsWriter never recovers from exceptions
> ----------------------------------------------
>
> Key: METRON-1153
> URL: https://issues.apache.org/jira/browse/METRON-1153
> Project: Metron
> Issue Type: Bug
> Reporter: Otto Fowler
>
> {code:java}
> o.a.m.w.BulkWriterComponent [ERROR] Failing 51 tuples
> java.io.IOException: Stream closed
> at
> org.apache.hadoop.crypto.CryptoOutputStream.checkStream(CryptoOutputStream.java:250)
> ~[stormjar.jar:?]
> at
> org.apache.hadoop.crypto.CryptoOutputStream.write(CryptoOutputStream.java:133)
> ~[stormjar.jar:?]
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> ~[stormjar.jar:?]
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> ~[?:1.8.0_131]
> at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
> ~[?:1.8.0_131]
> at
> org.apache.metron.writer.hdfs.SourceHandler.handle(SourceHandler.java:74)
> ~[stormjar.jar:?]
> at
> org.apache.metron.writer.hdfs.HdfsWriter.write(HdfsWriter.java:113)
> ~[stormjar.jar:?]
> at
> org.apache.metron.writer.BulkWriterComponent.flush(BulkWriterComponent.java:239)
> [stormjar.jar:?]
> at
> org.apache.metron.writer.BulkWriterComponent.flushTimeouts(BulkWriterComponent.java:281)
> [stormjar.jar:?]
> at
> org.apache.metron.writer.bolt.BulkMessageWriterBolt.execute(BulkMessageWriterBolt.java:211)
> [stormjar.jar:?]
> at
> org.apache.storm.daemon.executor$fn__6573$tuple_action_fn__6575.invoke(executor.clj:734)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at
> org.apache.storm.daemon.executor$mk_task_receiver$fn__6494.invoke(executor.clj:469)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at
> org.apache.storm.disruptor$clojure_handler$reify__6007.onEvent(disruptor.clj:40)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at
> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:451)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at
> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:430)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at
> org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at
> org.apache.storm.daemon.executor$fn__6573$fn__6586$fn__6639.invoke(executor.clj:853)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at org.apache.storm.util$async_loop$fn__554.invoke(util.clj:484)
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
> {code}
> The SourceHandler does not verify that the output stream it works with is
> open before writing. As a long running process, it should not assume that
> the stream is always valid.
> This is hard however, because there is no great way to verify that the stream
> is OK.
> Instead, the HdfsWriter would remove the source handler if there is an
> IOException, but then the issue is how we do not couple tuples to messages,
> which means that there will need to be refactoring from the bolt on down.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)