Thanks for the thoughtful and detailed reply Kay ... I'll follow up on your
suggestions and let you know if there's anything interesting being
observed. Cheers!

-Dave


On Sun, Sep 13, 2015 at 5:26 AM, Kay Röpke <[email protected]> wrote:

> Hi!
>
> > On 11 Sep 2015, at 18:42, David Dunstan <[email protected]> wrote:
> >
> > I'm looking to get some confirmation here ...
> >
> > We are running 1.1.2 with journaling on.
> >
> > We got ourselves into a situation where poor Elasticsearch latency
> caused our buffers and our journal to fill up.
> >
> > We addressed the ES problem (turned off throttling altogether), and
> Graylog started draining out pretty quickly.
> >
> > However, because we were looking at hours of drain time,  I started
> playing with the tunables to increase the the # of threads for processing
> and output, and increased the ringbuffer size to see if that would help.
> >
> > In the end, it didn't.
> >
> > I believe it's because Graylog's JournalReader is single-threaded (I
> guess to preserve ordering?); because of this the entire system will be
> constrained by the throughput the JournalReader is able to achieve. We had
> plenty of resources on the system: plenty of iops, cpu, memory headroom -
> but couldn't go any faster.
> >
> > Currently thinking if I am right about this, that we need to run more
> Graylog instances in parallel so we can parallelize reads from journal and
> catch up faster.
> >
> > Am I right? Is there anything to do to make it faster….
>
> You are correct that JournalReader is single threaded, just as the writer
> part of it is single threaded.
> This is due to the way we currently embed the underlying journal
> implementation (it’s the one from Kafka), it’s not necessarily to preserve
> ordering, since we cannot make any ordering guarantees across different
> inputs anyway. It’s really only because of implementation simplicity.
>
> In scenarios like these, having a large backlog in the journal, what you
> should be seeing is that the process buffer is completely full, because the
> JournalReader aims to keep it that way to maximize CPU utilization (it
> reads as many messages up to 5 MB in one go). In all scenarios we’ve
> observed so far the processing overhead completely trumps the reading from
> the journal time. Since the entire processing pipeline is blocking since
> 1.0 slow outputs will apply back pressure to reading from the journal,
> essentially limiting the entire throughput to the bandwidth of the network
> connection to Elasticsearch. It’s hard to make any more specific statements
> without knowing your exact setup, though.
>
> In theory we can use multiple Kafka partitions and thus multiple threads
> to write and read to disk, but we’ve found that this, unless your storage
> is exceptionally fast and you have dozens of cores to do processing, does
> not improve throughput.
>
> I’ve just had another look at the code and the only potential thing, apart
> from IO speed, that could be a bottleneck is the protobuf deserialization
> of the raw message. We could look at moving that to a subsequent step as
> well, making it multithreaded, but again, this is usually dwarfed by the
> time necessary to run the rest of the processing chain, i.e. extractors,
> stream matching and eventually output processing (serialization to ES json
> and actual network IO).
>
> It might be that you hit the 5 MB limit I mentioned above, to check if
> that’s the case, please set the log level of
> org.graylog2.shared.journal.KafkaJournal to DEBUG (you can do so via the
> server REST API). It prints messages like:
>
> "Requesting to read a maximum of {} messages (or 5MB) from the journal,
> offset interval [{}, {})"
> "Read {} messages, total payload size {}, from journal, offset interval
> [{}, {}], requested read at {}”
>
> total payload size is what you want, as well as comparing “maximum of
> messages” to “read {} messages”. If max msgs is less and total payload size
> is approximately 5MB then you are hitting this limit.
> Currently it’s not configurable, but that’s trivial to implement.
>
> There are more metrics we could look at, but this is the first thing that
> comes to mind.
>
> Let me know how it goes :)
>
> cheers,
> Kay
>
> --
> You received this message because you are subscribed to the Google Groups
> "Graylog Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/graylog2/EC17FBBF-A69D-4B4E-9704-03B117655E7B%40gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/CAJqE1KfkD2TbPDuFZao1F-rdrwtRm3XTvKpF0DrcrEY08A0wXA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to