Hi all,

I've managed to reproduce the stuck topology problem and it seems it's due
to the Netty transport. Running with ZMQ transport enabled now and I
haven't been able to reproduce this.

The problem is basically a Trident/Kafka transactional topology getting
stuck, i.e. re-emitting the same batches over and over again. This happens
after the Storm workers restart a few times due to Kafka spout throwing
RuntimeExceptions (because of the Kafka consumer in the spout timing out
with a SocketTimeoutException due to some temporary network problems).
Sometimes the topology is stuck after just one worker is restarted, and
sometimes a few worker restarts are needed to trigger the problem.

I simulated the Kafka spout socket timeouts by blocking network access from
Storm to my Kafka machines (with an iptables firewall rule). Most of the
time the spouts (workers) would restart normally (after re-enabling access
to Kafka) and the topology would continue to process batches, but sometimes
the topology would get stuck re-emitting batches after the crashed workers
restarted. Killing and re-submitting the topology manually fixes this
always, and processing continues normally.

I haven't been able to reproduce this scenario after reverting my Storm
cluster's transport to ZeroMQ. With Netty transport, I can almost always
reproduce the problem by causing a worker to restart a number of times
(only about 4-5 worker restarts are enough to trigger this).

Any hints on this? Anyone had the same problem? It does seem a serious
issue as it affect the reliability and fault tolerance of the Storm cluster.

In the meantime, I'll try to prepare a reproducible test case for this.

Thanks,

Danijel


On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <[email protected]>
wrote:

> To (partially) answer my own question -- I still have no idea on the cause
> of the stuck topology, but re-submitting the topology helps -- after
> re-submitting my topology is now running normally.
>
>
> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
> [email protected]> wrote:
>
>> Also, I did have multiple cases of my IBackingMap workers dying (because
>> of RuntimeExceptions) but successfully restarting afterwards (I throw
>> RuntimeExceptions in the BackingMap implementation as my strategy in rare
>> SQL database deadlock situations to force a worker restart and to
>> fail+retry the batch).
>>
>> From the logs, one such IBackingMap worker death (and subsequent restart)
>> resulted in the Kafka spout re-emitting the pending tuple:
>>
>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>> batch, attempt 29698959:736
>>
>> This is of course the normal behavior of a transactional topology, but
>> this is the first time I've encountered a case of a batch retrying
>> indefinitely. This is especially suspicious since the topology has been
>> running fine for 20 days straight, re-emitting batches and restarting
>> IBackingMap workers quite a number of times.
>>
>> I can see in my IBackingMap backing SQL database that the batch with the
>> exact txid value 29698959 has been committed -- but I suspect that could
>> come from another BackingMap, since there are two BackingMap instances
>> running (paralellismHint 2).
>>
>> However, I have no idea why the batch is being retried indefinitely now
>> nor why it hasn't been successfully acked by Trident.
>>
>> Any suggestions on the area (topology component) to focus my research on?
>>
>> Thanks,
>>
>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I'm having problems with my transactional Trident topology. It has been
>>> running fine for about 20 days, and suddenly is stuck processing a single
>>> batch, with no tuples being emitted nor tuples being persisted by the
>>> TridentState (IBackingMap).
>>>
>>> It's a simple topology which consumes messages off a Kafka queue. The
>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>> and I use the trident-mssql transactional TridentState implementation to
>>> persistentAggregate() data into a SQL database.
>>>
>>> In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>
>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>> "{"29698959":6487}"
>>>
>>> ... and the attempt count keeps increasing. It seems the batch with txid
>>> 29698959 is stuck, as the attempt count in Zookeeper keeps increasing --
>>> seems like the batch isn't being acked by Trident and I have no idea why,
>>> especially since the topology has been running successfully the last 20
>>> days.
>>>
>>> I did rebalance the topology on one occasion, after which it continued
>>> running normally. Other than that, no other modifications were done. Storm
>>> is at version 0.9.0.1.
>>>
>>> Any hints on how to debug the stuck topology? Any other useful info I
>>> might provide?
>>>
>>> Thanks,
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: [email protected]
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijel.schiavuzzi
>>>
>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: [email protected]
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijel.schiavuzzi
>>
>
>
>
> --
> Danijel Schiavuzzi
>
> E: [email protected]
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>



-- 
Danijel Schiavuzzi

E: [email protected]
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Reply via email to