Hi!
> On 11 Sep 2015, at 18:42, David Dunstan <[email protected]> wrote:
>
> I'm looking to get some confirmation here ...
>
> We are running 1.1.2 with journaling on.
>
> We got ourselves into a situation where poor Elasticsearch latency caused our
> buffers and our journal to fill up.
>
> We addressed the ES problem (turned off throttling altogether), and Graylog
> started draining out pretty quickly.
>
> However, because we were looking at hours of drain time, I started playing
> with the tunables to increase the the # of threads for processing and output,
> and increased the ringbuffer size to see if that would help.
>
> In the end, it didn't.
>
> I believe it's because Graylog's JournalReader is single-threaded (I guess to
> preserve ordering?); because of this the entire system will be constrained by
> the throughput the JournalReader is able to achieve. We had plenty of
> resources on the system: plenty of iops, cpu, memory headroom - but couldn't
> go any faster.
>
> Currently thinking if I am right about this, that we need to run more Graylog
> instances in parallel so we can parallelize reads from journal and catch up
> faster.
>
> Am I right? Is there anything to do to make it faster….
You are correct that JournalReader is single threaded, just as the writer part
of it is single threaded.
This is due to the way we currently embed the underlying journal implementation
(it’s the one from Kafka), it’s not necessarily to preserve ordering, since we
cannot make any ordering guarantees across different inputs anyway. It’s really
only because of implementation simplicity.
In scenarios like these, having a large backlog in the journal, what you should
be seeing is that the process buffer is completely full, because the
JournalReader aims to keep it that way to maximize CPU utilization (it reads as
many messages up to 5 MB in one go). In all scenarios we’ve observed so far the
processing overhead completely trumps the reading from the journal time. Since
the entire processing pipeline is blocking since 1.0 slow outputs will apply
back pressure to reading from the journal, essentially limiting the entire
throughput to the bandwidth of the network connection to Elasticsearch. It’s
hard to make any more specific statements without knowing your exact setup,
though.
In theory we can use multiple Kafka partitions and thus multiple threads to
write and read to disk, but we’ve found that this, unless your storage is
exceptionally fast and you have dozens of cores to do processing, does not
improve throughput.
I’ve just had another look at the code and the only potential thing, apart from
IO speed, that could be a bottleneck is the protobuf deserialization of the raw
message. We could look at moving that to a subsequent step as well, making it
multithreaded, but again, this is usually dwarfed by the time necessary to run
the rest of the processing chain, i.e. extractors, stream matching and
eventually output processing (serialization to ES json and actual network IO).
It might be that you hit the 5 MB limit I mentioned above, to check if that’s
the case, please set the log level of org.graylog2.shared.journal.KafkaJournal
to DEBUG (you can do so via the server REST API). It prints messages like:
"Requesting to read a maximum of {} messages (or 5MB) from the journal, offset
interval [{}, {})"
"Read {} messages, total payload size {}, from journal, offset interval [{},
{}], requested read at {}”
total payload size is what you want, as well as comparing “maximum of messages”
to “read {} messages”. If max msgs is less and total payload size is
approximately 5MB then you are hitting this limit.
Currently it’s not configurable, but that’s trivial to implement.
There are more metrics we could look at, but this is the first thing that comes
to mind.
Let me know how it goes :)
cheers,
Kay
--
You received this message because you are subscribed to the Google Groups
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/graylog2/EC17FBBF-A69D-4B4E-9704-03B117655E7B%40gmail.com.
For more options, visit https://groups.google.com/d/optout.