[jira] [Commented] (METRON-1153) HDFS HdfsWriter never recovers from exceptions

ASF GitHub Bot (JIRA) Fri, 08 Sep 2017 17:48:20 -0700

    [ 
https://issues.apache.org/jira/browse/METRON-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159609#comment-16159609
 ]


ASF GitHub Bot commented on METRON-1153:
----------------------------------------

Github user ottobackwards commented on the issue:

    https://github.com/apache/metron/pull/741
  
    @justinleet My question with regards to the necessity for a refactor came 
from wanting to handle the exception in the HDFSWriter, where the exceptions 
where being caught at the time.
    
    The fact that the tuples are 'unpaired' from their messages, and that we 
handle them all together seems to me to be problematic if you want to take per 
message/sourcehandle action.
    
    I think your approach removes the immediate problem with that, although 
having both the hdfs writer and the source handler itself take actions and 
split the 'ownership' doesn't feel quite right to me.
    
    @cestella It would be good to know why this is happening, but in truth, any 
persistent network connection in a long lived 'forever' type application needs 
to guard against these kinds of errors.
    
    For reliability reasons, and where we want to get to, we require a more 
clear documentation of the failure and recovery states of the writers, esp. the 
hdfs as we are batching.  We also need to understand all the ways the stream 
can fail, to the extent that that is possible.
    
    I am not sure that limiting it to this one case is enough,  there will 
still be many possible ways to end up with a very unclear situation, that the 
writer is failing continuously and but the source handler is not removed from 
'service'.  Users, as we have seen on the list will be left to track through 
system to work their way to this problem.  
    
    Failure to store is a critical failure of the system.  Esp. in systems 
where there are data retention rules or SLA's on data loss.  Thus we need in 
addition to the handling of this, an alerting strategy.
    
    While this fix ( questions pending ) is an improvement of the symptom, it 
does not address the higher level issue or severity of this problem.


> HDFS HdfsWriter never recovers from exceptions
> ----------------------------------------------
>
>                 Key: METRON-1153
>                 URL: https://issues.apache.org/jira/browse/METRON-1153
>             Project: Metron
>          Issue Type: Bug
>            Reporter: Otto Fowler
>
> {code:java}
> o.a.m.w.BulkWriterComponent [ERROR] Failing 51 tuples
> java.io.IOException: Stream closed
>         at 
> org.apache.hadoop.crypto.CryptoOutputStream.checkStream(CryptoOutputStream.java:250)
>  ~[stormjar.jar:?]
>         at 
> org.apache.hadoop.crypto.CryptoOutputStream.write(CryptoOutputStream.java:133)
>  ~[stormjar.jar:?]
>         at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
>  ~[stormjar.jar:?]
>         at java.io.DataOutputStream.write(DataOutputStream.java:107) 
> ~[?:1.8.0_131]
>         at java.io.FilterOutputStream.write(FilterOutputStream.java:97) 
> ~[?:1.8.0_131]
>         at 
> org.apache.metron.writer.hdfs.SourceHandler.handle(SourceHandler.java:74) 
> ~[stormjar.jar:?]
>         at 
> org.apache.metron.writer.hdfs.HdfsWriter.write(HdfsWriter.java:113) 
> ~[stormjar.jar:?]
>         at 
> org.apache.metron.writer.BulkWriterComponent.flush(BulkWriterComponent.java:239)
>  [stormjar.jar:?]
>         at 
> org.apache.metron.writer.BulkWriterComponent.flushTimeouts(BulkWriterComponent.java:281)
>  [stormjar.jar:?]
>         at 
> org.apache.metron.writer.bolt.BulkMessageWriterBolt.execute(BulkMessageWriterBolt.java:211)
>  [stormjar.jar:?]
>         at 
> org.apache.storm.daemon.executor$fn__6573$tuple_action_fn__6575.invoke(executor.clj:734)
>  [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at 
> org.apache.storm.daemon.executor$mk_task_receiver$fn__6494.invoke(executor.clj:469)
>  [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at 
> org.apache.storm.disruptor$clojure_handler$reify__6007.onEvent(disruptor.clj:40)
>  [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at 
> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:451)
>  [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at 
> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:430)
>  [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at 
> org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
>  [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at 
> org.apache.storm.daemon.executor$fn__6573$fn__6586$fn__6639.invoke(executor.clj:853)
>  [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at org.apache.storm.util$async_loop$fn__554.invoke(util.clj:484) 
> [storm-core-1.0.1.2.5.6.0-40.jar:1.0.1.2.5.6.0-40]
>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
> {code}
> The SourceHandler does not verify that the output stream it works with is 
> open before writing.  As a long running process, it should not assume that 
> the stream is always valid.
> This is hard however, because there is no great way to verify that the stream 
> is OK.
> Instead, the HdfsWriter would remove the source handler if there is an 
> IOException, but then the issue is how we do not couple tuples to messages, 
> which means that there will need to be refactoring from the bolt on down.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (METRON-1153) HDFS HdfsWriter never recovers from exceptions

Reply via email to