Heka community: I would like to share my experiences with using Heka to parse Apache log files for insertion into InfluxDB. My initial testing & configuration started with an out of the box configuration consisting of:
1) Heka (v0.11) & InfluxDB (v0.9.3) both running on a single server: Ubuntu 14.04.2 LTS, trusty 2) A single Apache access log that I read from the local file system having ~ 850K log entries. 3) A Heka configuration of: LogstreamerInput -> SandboxDecoder: apache_access.lua -> SandboxEncoder: schema_influx_line.lua -> HttpOutput to InfluxDB The performance of this configuration was unsuitable for production taking over 12 full hours to complete the processing of a single log file. In comparison to using a LogOutput which completed in approximately 3 minutes it was clear I needed to batch write records to InfluxDB. My initial attempt to batch records via Lua was a weak effort and ultimately unsuccessful. Attempting to queue records into a Lua table (likely the incorrect approach) lead to out of memory errors by the Lua sandbox for batch sizes exceeding ~200 messages. Moreover, my HttpOutput then started generating timeout errors communicating with InfluxDB. Not being well versed in Lua & having to develop in a sandbox environment without the aid of any meaningful logging capabilities, this approach was way too unproductive for me to continue developing or debugging further. My second attempt uses a native InfluxDB output plugin I created that is based on the existing ElasticSearchOutput plugin having the ability to batch write records via HTTP. Changing HttpOutput in the above initial configuration to this new plugin has altered the performance dramatically. I’m now able to process a single Apache access log in ~ 4 minutes. And, I’ve loaded 31 days of historical Apache logs through Heka -> InfluxDB in under 2 hours. The number of records I’ve imported exceeds 36 million for each of three distinct time series for a total sum exceeding 108 million records. The performance of this has far exceeded our expectations and we are now running Heka on a production server. There’s no appreciable CPU load for doing this and we’re able to write directly to InfluxDB, thus eliminating the need for log shippers to a central server as was required with Logstash. I have three requests: 1) I would greatly appreciate having a native InfluxDB output plugin included with future releases of Heka and would like to contribute my work for your review and consideration. Whether a separate plugin exists for ElasticSearch/InfluxDB or whether a generalized BatchHttpOutput plugin emerges is worth considering. The difference between the ElasticSearchOutput plugin and my modified InfluxDB plugin is largely minimal. First, the ElasticSearch plugin assumes a fixed endpoint (/_bulk) whereas InfluxDB relies on a query string. Second, ElasticSearch returns a JSON response whereas InfluxDB returns an HTTP status of 204 - no content. Both ElasticSearch & InfluxDB support TLS & UDP though I’ve not tested either of these features with InfluxDB. Differences beyond these are minor. 2) I’ve found one problem with my output plugin that appears unrelated to my changes or InfluxDB and most likely exists for ElasticSearch as well. While using the LogstreamerInput to read a single file & using my InfluxOutput (c.f. ElasticSearchOutput) with: 'use_buffering = true’, everything works fine. When using the LogstreamerInput to read multiple files having a file match pattern/priority I have to turn off buffering or I receive the following errors: 2015/08/24 14:56:43 Diagnostics: 1 packs have been idle more than 120 seconds. 2015/08/24 14:56:43 Diagnostics: (input) Plugin names and quantities found on idle packs: From a previous discussion, it would appear there’s a deadlock occurring. Please advise on how to debug this further. 3) While attempting to develop customizations via the Lua Sandbox, the only practical logging facility I could use was: add_to_payload(). But, that was out of scope from lua_modules/. I would like to know how best to relax the sandbox restrictions to gain access to the Lua IO library for being able to capture output to stdio/log files. Or, in general what advise do you offer on how best to develop/debug code via the Lua Sandbox? Thanks, Chris
_______________________________________________ Heka mailing list [email protected] https://mail.mozilla.org/listinfo/heka

