Funny you should ask about Splunk. This is to *replace* splunk because the licensing fees are getting so high our IT pulled the plug on DEV instances going to Splunk. I'm prototyping a "splunk lite" using an existing licensed ML server to see if it would work.
---------------------------------------- David A. Lee Senior Principal Software Engineer Epocrates, Inc. [email protected] 812-482-5224 -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Kelly Stirman Sent: Thursday, June 30, 2011 9:31 AM To: [email protected] Subject: Re: [MarkLogic Dev General] "Minimum" overhead for loading documents Hi David, Have you looked at Splunk? I would disable -inherit permissions -inherit collections -inherit quality -maintain last modified -directory maintain last modified And set directory creation to manual. If you disable all the database indexes and load your log files as marked up documents, you still have indexes for: -uri -structure -element and attribute values (but not individual tokens) -collections -security -and a few other things I'm probably overlooking I don't know if this satisfies your query requirements. You could configure word query to selectively add indexes for specific elements, and even weight them accordingly. You could enable a few range indexes, like the timestamp, to enable sorting and analytics. All of these changes will improve ingestion throughput at the MarkLogic level, and you'll need to pay close attention to disk I/O. You can run some tests with this configuration to determine the throughput of an individual machine, and extrapolate from there how many forests and servers you need to support your desired ingestion rate. You can report back on your findings and we can help you with sizing. This might be an application with broad appeal - perhaps others can pitch in on parsers for the different log formats (multi-line logs seem rather trick). If you want to prune days of data out of the database, I would store each day as a collection and use collection delete. If you have triggers disabled, and the settings above, MarkLogic performs collection deletes very efficiently. Kelly Message: 4 Date: Wed, 29 Jun 2011 23:06:52 +0000 From: "Lee, David" <[email protected]> Subject: [MarkLogic Dev General] "Minimum" overhead for loading documents To: "General Mark Logic Developer Discussion ([email protected])" <[email protected]> Message-ID: <31395bf86e0a454f832b8f8824ed6bda02e...@exmb-pp03.corp.epocrates.com> Content-Type: text/plain; charset="us-ascii" I'm considering using MarkLogic as a log file analyzer. This means storing possibly 100's of GB of fairly flat structure ( think log4j, apache, and tomcat output ). Why use ML at all when maybe something like mysql would be better ? I think the dateTime indexing and freefiorm text searching would be extremely valuable. Also, although the raw data is "flat" resultant analized data may well be hierarchal (imagine creating a call stack diagram from log files). I think this is a perfect use of ML and XQuery. (but I may be insane). With that in mind I'm curious how to make this efficient in time & space. If I make a new forest & database just for this ...what minimizes the time to load and key space ? My *guess* is to minimize all the parameters in the database affecting indexing to the bare minimum, possibly none except for explicit indexes ... or maybe a simple word index. But would love opinions. Data rate I'm looking at is approx. 10GB/day - continuously ... and likely may need to archive off anything over a few days old (so 50GB might be a reasonable max storage). I'd like the data to be fed in realtime and not require a rack of 100 servers to do it ... Ideas welcome ! (including "your insane just use another tool"). ---------------------------------------- David A. Lee Senior Principal Software Engineer Epocrates, Inc. [email protected]<mailto:[email protected]> 812-482-5224 _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
