Thanks for the thoughtful and detailed reply Kay ... I'll follow up on your suggestions and let you know if there's anything interesting being observed. Cheers!
-Dave On Sun, Sep 13, 2015 at 5:26 AM, Kay Röpke <[email protected]> wrote: > Hi! > > > On 11 Sep 2015, at 18:42, David Dunstan <[email protected]> wrote: > > > > I'm looking to get some confirmation here ... > > > > We are running 1.1.2 with journaling on. > > > > We got ourselves into a situation where poor Elasticsearch latency > caused our buffers and our journal to fill up. > > > > We addressed the ES problem (turned off throttling altogether), and > Graylog started draining out pretty quickly. > > > > However, because we were looking at hours of drain time, I started > playing with the tunables to increase the the # of threads for processing > and output, and increased the ringbuffer size to see if that would help. > > > > In the end, it didn't. > > > > I believe it's because Graylog's JournalReader is single-threaded (I > guess to preserve ordering?); because of this the entire system will be > constrained by the throughput the JournalReader is able to achieve. We had > plenty of resources on the system: plenty of iops, cpu, memory headroom - > but couldn't go any faster. > > > > Currently thinking if I am right about this, that we need to run more > Graylog instances in parallel so we can parallelize reads from journal and > catch up faster. > > > > Am I right? Is there anything to do to make it faster…. > > You are correct that JournalReader is single threaded, just as the writer > part of it is single threaded. > This is due to the way we currently embed the underlying journal > implementation (it’s the one from Kafka), it’s not necessarily to preserve > ordering, since we cannot make any ordering guarantees across different > inputs anyway. It’s really only because of implementation simplicity. > > In scenarios like these, having a large backlog in the journal, what you > should be seeing is that the process buffer is completely full, because the > JournalReader aims to keep it that way to maximize CPU utilization (it > reads as many messages up to 5 MB in one go). In all scenarios we’ve > observed so far the processing overhead completely trumps the reading from > the journal time. Since the entire processing pipeline is blocking since > 1.0 slow outputs will apply back pressure to reading from the journal, > essentially limiting the entire throughput to the bandwidth of the network > connection to Elasticsearch. It’s hard to make any more specific statements > without knowing your exact setup, though. > > In theory we can use multiple Kafka partitions and thus multiple threads > to write and read to disk, but we’ve found that this, unless your storage > is exceptionally fast and you have dozens of cores to do processing, does > not improve throughput. > > I’ve just had another look at the code and the only potential thing, apart > from IO speed, that could be a bottleneck is the protobuf deserialization > of the raw message. We could look at moving that to a subsequent step as > well, making it multithreaded, but again, this is usually dwarfed by the > time necessary to run the rest of the processing chain, i.e. extractors, > stream matching and eventually output processing (serialization to ES json > and actual network IO). > > It might be that you hit the 5 MB limit I mentioned above, to check if > that’s the case, please set the log level of > org.graylog2.shared.journal.KafkaJournal to DEBUG (you can do so via the > server REST API). It prints messages like: > > "Requesting to read a maximum of {} messages (or 5MB) from the journal, > offset interval [{}, {})" > "Read {} messages, total payload size {}, from journal, offset interval > [{}, {}], requested read at {}” > > total payload size is what you want, as well as comparing “maximum of > messages” to “read {} messages”. If max msgs is less and total payload size > is approximately 5MB then you are hitting this limit. > Currently it’s not configurable, but that’s trivial to implement. > > There are more metrics we could look at, but this is the first thing that > comes to mind. > > Let me know how it goes :) > > cheers, > Kay > > -- > You received this message because you are subscribed to the Google Groups > "Graylog Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/graylog2/EC17FBBF-A69D-4B4E-9704-03B117655E7B%40gmail.com > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/CAJqE1KfkD2TbPDuFZao1F-rdrwtRm3XTvKpF0DrcrEY08A0wXA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
