Thanks for the thoughtful and detailed reply Kay ... I'll follow up on your
suggestions and let you know if there's anything interesting being
observed. Cheers!

-Dave


On Sun, Sep 13, 2015 at 5:26 AM, Kay Röpke <kroe...@gmail.com> wrote:

> Hi!
>
> > On 11 Sep 2015, at 18:42, David Dunstan <daduns...@twilio.com> wrote:
> >
> > I'm looking to get some confirmation here ...
> >
> > We are running 1.1.2 with journaling on.
> >
> > We got ourselves into a situation where poor Elasticsearch latency
> caused our buffers and our journal to fill up.
> >
> > We addressed the ES problem (turned off throttling altogether), and
> Graylog started draining out pretty quickly.
> >
> > However, because we were looking at hours of drain time,  I started
> playing with the tunables to increase the the # of threads for processing
> and output, and increased the ringbuffer size to see if that would help.
> >
> > In the end, it didn't.
> >
> > I believe it's because Graylog's JournalReader is single-threaded (I
> guess to preserve ordering?); because of this the entire system will be
> constrained by the throughput the JournalReader is able to achieve. We had
> plenty of resources on the system: plenty of iops, cpu, memory headroom -
> but couldn't go any faster.
> >
> > Currently thinking if I am right about this, that we need to run more
> Graylog instances in parallel so we can parallelize reads from journal and
> catch up faster.
> >
> > Am I right? Is there anything to do to make it faster….
>
> You are correct that JournalReader is single threaded, just as the writer
> part of it is single threaded.
> This is due to the way we currently embed the underlying journal
> implementation (it’s the one from Kafka), it’s not necessarily to preserve
> ordering, since we cannot make any ordering guarantees across different
> inputs anyway. It’s really only because of implementation simplicity.
>
> In scenarios like these, having a large backlog in the journal, what you
> should be seeing is that the process buffer is completely full, because the
> JournalReader aims to keep it that way to maximize CPU utilization (it
> reads as many messages up to 5 MB in one go). In all scenarios we’ve
> observed so far the processing overhead completely trumps the reading from
> the journal time. Since the entire processing pipeline is blocking since
> 1.0 slow outputs will apply back pressure to reading from the journal,
> essentially limiting the entire throughput to the bandwidth of the network
> connection to Elasticsearch. It’s hard to make any more specific statements
> without knowing your exact setup, though.
>
> In theory we can use multiple Kafka partitions and thus multiple threads
> to write and read to disk, but we’ve found that this, unless your storage
> is exceptionally fast and you have dozens of cores to do processing, does
> not improve throughput.
>
> I’ve just had another look at the code and the only potential thing, apart
> from IO speed, that could be a bottleneck is the protobuf deserialization
> of the raw message. We could look at moving that to a subsequent step as
> well, making it multithreaded, but again, this is usually dwarfed by the
> time necessary to run the rest of the processing chain, i.e. extractors,
> stream matching and eventually output processing (serialization to ES json
> and actual network IO).
>
> It might be that you hit the 5 MB limit I mentioned above, to check if
> that’s the case, please set the log level of
> org.graylog2.shared.journal.KafkaJournal to DEBUG (you can do so via the
> server REST API). It prints messages like:
>
> "Requesting to read a maximum of {} messages (or 5MB) from the journal,
> offset interval [{}, {})"
> "Read {} messages, total payload size {}, from journal, offset interval
> [{}, {}], requested read at {}”
>
> total payload size is what you want, as well as comparing “maximum of
> messages” to “read {} messages”. If max msgs is less and total payload size
> is approximately 5MB then you are hitting this limit.
> Currently it’s not configurable, but that’s trivial to implement.
>
> There are more metrics we could look at, but this is the first thing that
> comes to mind.
>
> Let me know how it goes :)
>
> cheers,
> Kay
>
> --
> You received this message because you are subscribed to the Google Groups
> "Graylog Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to graylog2+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/graylog2/EC17FBBF-A69D-4B4E-9704-03B117655E7B%40gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/CAJqE1KfkD2TbPDuFZao1F-rdrwtRm3XTvKpF0DrcrEY08A0wXA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to