[heka] Slow throughput using Kafka as Input

Mathieu Chouquet-Stringer Tue, 05 May 2015 06:14:26 -0700

        Hello,

I'm facing a somewhat similar issue than Antonin described in
https://mail.mozilla.org/pipermail/heka/2015-April/000522.html where
Heka is slow getting stuff from Kafka (around 1,600 messages/s).


I have setup a Kafka instance where I dump a truck load of log messages.

Without tuning anything, Heka is able to read from files and to output
roughly 100,000 lines/s of logs to Kafka using the Kafka output plugin.

I have 3 distinct servers (all VMs):
- 1 heka used to feed Kafka (producer): this one does at least 100,000
  messages/s with a basic config [1], running heka 0.9.2
- 1 kafka (standard config), running kafka 0.8.2.1
- 1 heka used to parse data coming from kafka (consumer), running heka
  0.9.2

This looks like a convoluted setup but it is required by the way I
receive the logs (meaning I couldn't really merge the 2 heka instances,
specially if I want to add more consumers).

If I setup my heka consumer (the last server in the list above), as a
simple heka input without output or decoder or anything (cf config in
[2]), I only get around 1,600 messages/s from Kafka.

I know Kafka can do better because locally (on the kafka server) I can
easily get around 354,000 messages/s (using kafka-console-consumer.sh).

And remotely (ie on the consumer machine), using kafka-console-consumer.sh,
I get around 340,000 messages/s.

I've played with maxprocs, plugin_chansize, max_process_inject,
poolsize, default_fetch_size, event_buffer_size, max_open_reqests,
background_refresh_frequency, max_wait_time and I haven't been able to
do much better.

The only way I've managed to get more messages/s is to duplicate my
input (and to run more heka servers): ie to declare a KafkaInput2 block
similar to KafkaInput1.

Ideally I'd like to sustain at least around 80,000 messages/s or at
least be CPU bound by my decoders, right now the bottleneck is the
input.

Is this something anyone has experienced? Is there something wrong in my
setup?

Wait, I just discovered something, if I switch the offset_method to
oldest, I get around 5,000 messages/s instead of 1,600?

Cheers,
Mathieu

============================================================
[1] Producer config

[hekad]
maxprocs = 4
base_dir = "/data/heka-0_9_2-linux-amd64"
share_dir = "/data/heka-0_9_2-linux-amd64/share/heka"
poolsize = 300

[syslog]
type = "LogstreamerInput"
log_directory = "/data/syslog/log"
oldest_duration = "1h"
file_match =
'heka_(?P<Year>\d+)-(?P<Month>\d+)-(?P<Day>\d+)-(?P<Hour>\d+)-(?P<Minute>\d+).log'
priority = ["Year", "Month", "Day", "Hour", "Minute"]

[FxaKafkaOutput]
type = "KafkaOutput"
message_matcher = "TRUE"
topic = "logs"
addrs = ["10.100.100.37:9092"]
encoder = "ProtobufEncoder"
partitioner = "RoundRobin"

[DashboardOutput]
ticker_interval = 5


============================================================
[2] Consumer config

[hekad]
maxprocs = 4
base_dir = "/home/heka/heka"
share_dir = "/home/heka/heka/share/heka"
plugin_chansize = 300
max_process_inject = 40
poolsize = 2000

[KafkaInput1]
type = "KafkaInput"
topic = "logs"
addrs = ["10.100.100.37:9092"]
splitter = "KafkaSplitter"
decoder = "ProtobufDecoder"

[KafkaSplitter]
type = "NullSplitter"
use_message_bytes = true

[DashboardOutput]
ticker_interval = 5


-- 
Mathieu Chouquet-Stringer                               [email protected]
            The sun itself sees not till heaven clears.
                     -- William Shakespeare --
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

[heka] Slow throughput using Kafka as Input

Reply via email to