I’m trying to think of what information would be needed to look at this the
next time it happens.  If someone wanted to reproduce this.
Examples of a set of configurations that do not work maybe.
The metrics on the parser topologies involved ( how many parsers, message
rate ).
The platform_info.sh output for the machines running the indexing topology.
Any load information for those machines as well…..

Anyone think of anything else?



On December 7, 2017 at 07:45:26, Otto Fowler (ottobackwa...@gmail.com)
wrote:

We use TreeCache
<https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/cache/TreeCache.html>
.

When the configuration is updated in zookeeper, the configuration object in
the bolt is updated. This configuration is read on each message, so I think
from what I see new configurations should get picked up for the next
message.

I could be wrong though.



On December 7, 2017 at 06:47:15, Ali Nazemian (alinazem...@gmail.com) wrote:

Thank you very much. Unfortunately, reproducing all the situations are very
costly for us at this moment. We are kind of avoiding to hit that issue by
using the same batch size for all the feeds. Hopefully, with the new PR
Casey provided for the segregation of ES and HDFS, it will be very much
clear to tune them.

Do you know how the synchronization of indexing config will happen with the
topology? Does the topology gets synchronised by pulling the last configs
from ZK based on some background mechanism or it is based on an update
trigger? As I mentioned, based on our observation it looks like the
synchronization doesn't work until all the old messages in Kafka queue get
processed based on the old indexing configs.

Regards,
Ali

On Thu, Dec 7, 2017 at 12:33 AM, Otto Fowler <ottobackwa...@gmail.com>
wrote:

> Sorry,
> We flush for timeouts on every storm ‘tick’ message, not on every message.
>
>
>
> On December 6, 2017 at 08:29:51, Otto Fowler (ottobackwa...@gmail.com)
> wrote:
>
> I have looked at it.
>
> We maintain batch lists for each sensor which gather messages to index.
> When we get a message that puts it over the batch size the messages are
> flushed and written to the target.
> There is also a timeout component, where the batch would be flushed based
> on timeout.
>
> While batch size checking occurs on a per sensor-message receipt basis,
> each message, regardless of sensor will trigger a check of the batch
> timeout for all the lists.
>
> At least that is what I think I see.
>
> Without understanding what the failures are for it is hard to see what the
> issue is.
>
> Do we have timing issues where all the lists are timing out all the time
> causing some kind of cascading failure for example?
> Does the number of sensors matter?  For example if only one sensor
> topology is running with batch setup X, is everything fine?  Do failures
> start after adding Nth additional sensor?
>
> Hopefully someone else on the list may have an idea.
> That code does not have any logging to speak of… well debug / trace
> logging that would help here either.
>
>
>
> On December 6, 2017 at 08:18:01, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Everything looks normal except the high number of failed tuples. Do you
> know how the indexing batch size works? Based on our observations it seems
> it doesn't update the messages that are in enrichments and indexing topics.
>
> On Thu, Dec 7, 2017 at 12:13 AM, Otto Fowler <ottobackwa...@gmail.com>
> wrote:
>
>> What do you see in the storm ui for the indexing topology?
>>
>>
>> On December 6, 2017 at 07:10:17, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>>
>> Both hdfs and Elasticsearch batch sizes. There is no error in the logs.
>> It mpacts topology error rate and cause almost 90% error rate on indexing
>> tuples.
>>
>> On 6 Dec. 2017 00:20, "Otto Fowler" <ottobackwa...@gmail.com> wrote:
>>
>> Where are you seeing the errors?  Screenshot?
>>
>>
>> On December 5, 2017 at 08:03:46, Otto Fowler (ottobackwa...@gmail.com)
>> wrote:
>>
>> Which of the indexing options are you changing the batch size for?
>> HDFS?  Elasticsearch?  Both?
>>
>> Can you give an example?
>>
>>
>>
>> On December 5, 2017 at 02:09:29, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>>
>> No specific error in the logs. I haven't enabled debug/trace, though.
>>
>> On Tue, Dec 5, 2017 at 11:54 AM, Otto Fowler <ottobackwa...@gmail.com>
>> wrote:
>>
>>> My first thought is what are the errors when you get a high error rate?
>>>
>>>
>>> On December 4, 2017 at 19:34:29, Ali Nazemian (alinazem...@gmail.com)
>>> wrote:
>>>
>>> Any thoughts?
>>>
>>> On Sun, Dec 3, 2017 at 11:27 PM, Ali Nazemian <alinazem...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We have noticed recently that no matter what batch size we use for
>>> Metron
>>> > indexing feeds, as long as we start using different batch size for
>>> > different Metron feeds, indexing topology throughput will start
>>> dropping
>>> > due to the high error rate! So I was wondering whether based on the
>>> current
>>> > indexing topology design, we have to choose the same batch size for
>>> all the
>>> > feeds or not. Otherwise, throughout will be dropped. I assume since it
>>> is
>>> > acceptable to use different batch sizes for different feeds, it is not
>>> > expected by design.
>>> >
>>> > Moreover, I have noticed in practice that even if we change the batch
>>> > size, it will not affect the messages that are already in enrichments
>>> or
>>> > indexing topics, and it will only affect the new messages that are
>>> coming
>>> > to the parser. Therefore, we need to let all the messages pass the
>>> indexing
>>> > topology so that we can change the batch size!
>>> >
>>> > It would be great if we can have more details regarding the design of
>>> this
>>> > section so we can understand our observations are based on the design
>>> or
>>> > some kind of bug.
>>> >
>>> > Regards,
>>> > Ali
>>> >
>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>>
>>
>
>
> --
> A.Nazemian
>
>


--
A.Nazemian

Reply via email to