On 08/25/2015 02:16 PM, Giordano, J C. wrote:
Heka community:
I would like to share my experiences with using Heka to parse Apache log
files for insertion into InfluxDB. My initial testing & configuration
started with an out of the box configuration consisting of:
1) Heka (v0.11) & InfluxDB (v0.9.3) both running on a single server:
Ubuntu 14.04.2 LTS, trusty
2) A single Apache access log that I read from the local file system
having ~ 850K log entries.
3) A Heka configuration of: LogstreamerInput -> SandboxDecoder:
apache_access.lua -> SandboxEncoder: schema_influx_line.lua ->
HttpOutput to InfluxDB
The performance of this configuration was unsuitable for production
taking over 12 full hours to complete the processing of a single log
file. In comparison to using a LogOutput which completed in
approximately 3 minutes it was clear I needed to batch write records to
InfluxDB.
Yup, clearly unacceptable.
My initial attempt to batch records via Lua was a weak effort and
ultimately unsuccessful. Attempting to queue records into a Lua table
(likely the incorrect approach)
If I were to do this, I'd encode the lines right in the filter, so the filter
periodically emits a message payload where each line is an InfluxDB line. As
you can see from looking at the encoder
(https://github.com/mozilla-services/heka/blob/dev/sandbox/lua/encoders/schema_influx_line.lua#L142),
all of the hard work is done in a reusable module. You'd just call
`add_to_payload` in the process_message function and then `inject_payload` in
the timer_event function. Then you'd just use a PayloadEncoder w/ the
HttpOutput.
lead to out of memory errors by the Lua
sandbox for batch sizes exceeding ~200 messages.
For sizeable batches you'd need to bump the memory_limit setting. You also might want to
increase the instruction_limit and output_limit values. These are all covered in the
"Common Sandbox Parameters" documentation:
http://hekad.readthedocs.org/en/v0.10.0b1/config/common_sandbox_parameter.html#config-common-sandbox-parameters
Moreover, my
HttpOutput then started generating timeout errors communicating with
InfluxDB.
Not sure what was causing this.
Not being well versed in Lua & having to develop in a sandbox
environment without the aid of any meaningful logging capabilities, this
approach was way too unproductive for me to continue developing or
debugging further.
Understood.
My second attempt uses a native InfluxDB output plugin I created that is
based on the existing ElasticSearchOutput plugin having the ability to
batch write records via HTTP. Changing HttpOutput in the above initial
configuration to this new plugin has altered the performance
dramatically. I’m now able to process a single Apache access log in ~
4 minutes. And, I’ve loaded 31 days of historical Apache logs through
Heka -> InfluxDB in under 2 hours. The number of records I’ve imported
exceeds 36 million for each of three distinct time series for a total
sum exceeding 108 million records. The performance of this has far
exceeded our expectations and we are now running Heka on a production
server. There’s no appreciable CPU load for doing this and we’re able
to write directly to InfluxDB, thus eliminating the need for log
shippers to a central server as was required with Logstash.
I'm very glad you've got a solution that has exceeded your expectations. :)
Clearly, though, you had to work way too hard to do so.
If you're willing to work with me, I'd be very interested in finding out what
sort of throughput you'd get if you used a batching filter like the one I
described above along with the HttpOutput. I'd be happy to provide you with the
source code for such a filter, and to help with the configuration to make sure
it all works as desired.
I have three requests:
1) I would greatly appreciate having a native InfluxDB output plugin
included with future releases of Heka and would like to contribute my
work for your review and consideration.
InfluxDB is widely used enough that I'm open to considering a native output
plugin, if that's really the only way we can achieve what we want in terms of
ease-of-use and performance. That's a last resort to me, though. I think it's
worth experimenting a bit more to see if we can hit our goals without it. If
the batching filter I describe above works, we can add that to the core and
we'd need much less new code.
Whether a separate plugin
exists for ElasticSearch/InfluxDB or whether a generalized
BatchHttpOutput plugin emerges is worth considering.
A BatchHttpOutput that works for both ES and InfluxDB is much more attractive
to me than separate plugins dedicated to each.
The difference
between the ElasticSearchOutput plugin and my modified InfluxDB plugin
is largely minimal. First, the ElasticSearch plugin assumes a fixed
endpoint (/_bulk) whereas InfluxDB relies on a query string. Second,
ElasticSearch returns a JSON response whereas InfluxDB returns an HTTP
status of 204 - no content. Both ElasticSearch & InfluxDB support TLS &
UDP though I’ve not tested either of these features with InfluxDB.
Differences beyond these are minor.
Ideally UDP would be handled by a different output, having both protocols adds
a lot of IMO unhelpful complexity to the ES output. If the batching at the
filter level works well, that will work just as well with UdpOutput as with
HttpOutput.
2) I’ve found one problem with my output plugin that appears unrelated
to my changes or InfluxDB and most likely exists for ElasticSearch as well.
While using the LogstreamerInput to read a single file & using my
InfluxOutput (c.f. ElasticSearchOutput) with: 'use_buffering = true’,
everything works fine.
When using the LogstreamerInput to read multiple files having a file
match pattern/priority I have to turn off buffering or I receive the
following errors:
2015/08/24 14:56:43 Diagnostics: 1 packs have been idle more than 120
seconds.
2015/08/24 14:56:43 Diagnostics: (input) Plugin names and quantities
found on idle packs:
Are there any subsequent lines that tell you which plugins have the idle pack?
From a previous discussion, it would appear there’s a deadlock
occurring. Please advise on how to debug this further.
Hrm, this is confusing. The LogstreamerInput code and the router layer
buffering code have absolutely nothing to do with each other. I don't
understand how this error could be related to an interaction between those two
settings. I don't have any debugging suggestions (other than looking at the
surrounding log lines, per my question above), but if you open an issue with a
way I can reproduce the error I'd be happy to take a look.
3) While attempting to develop customizations via the Lua Sandbox, the
only practical logging facility I could use was: add_to_payload(). But,
that was out of scope from lua_modules/.
Hrm, surprising. You should be able to get to the standard API functions even
from within modules. Maybe you were in a module that had excluded the global
namespace?
I would like to know how best
to relax the sandbox restrictions to gain access to the Lua IO library
for being able to capture output to stdio/log files.
The sandbox initialization parameters for decoder, filter, and encoder plugins
can be found here:
https://github.com/mozilla-services/heka/blob/dev/sandbox/lua/lua_sandbox.go.in#L53
You can temporarily allow blocked entries by editing that and rebuilding. If
you remove `'print'` from the list on line 60, for instance, you'll be able to
use print in your code.
Or, in general
what advise do you offer on how best to develop/debug code via the Lua
Sandbox?
Usually I can get enough debugging context just by returning errors with error
messages, or emitting messages with debug output using inject_message or
inject_payload.
Thanks,
Thank you, hope this was helpful. Hopefully you'll be willing to try out the
batch influx filter...
-r
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka