I don't know about mySQL, but you might want to investigate other noSQL
options. On the surface, while I'm sure you can get MarkLogic (a noSQL option)
to work here, I really wonder if it is your best choice. On the surface,
this sounds like a more straightforward key-value problem (as in Cassandra) or
very basic document problem (think MongoDB). If you do some google searches,
you will find many examples of people that have used MongoDB explicitly for
this purpose. In addition, depending on the type of analysis you plan on doing
I assume you might need some Hadoop jobs as well. There is currently built-in
support for Hadoop in both Cassandra and MongoDB ... I believe a Hadoop
connector will be coming in the next big release of MarkLogic (at least that
was my understanding from the user conference).
Darin.
________________________________
From: "Lee, David" <[email protected]>
To: "General Mark Logic Developer Discussion ([email protected])"
<[email protected]>
Sent: Wednesday, June 29, 2011 7:06 PM
Subject: [MarkLogic Dev General] "Minimum" overhead for loading documents
I'm considering using MarkLogic as a log file analyzer.
This means storing possibly 100's of GB of fairly flat structure ( think log4j,
apache, and tomcat output ).
Why use ML at all when maybe something like mysql would be better ?
I think the dateTime indexing and freefiorm text searching would be extremely
valuable.
Also, although the raw data is "flat" resultant analized data may well be
hierarchal (imagine creating a call stack diagram from log files). I think
this is a perfect use of ML and XQuery. (but I may be insane).
With that in mind I'm curious how to make this efficient in time & space.
If I make a new forest & database just for this ...what minimizes the time to
load and key space ?
My *guess* is to minimize all the parameters in the database affecting indexing
to the bare minimum, possibly none except for explicit indexes ... or maybe a
simple word index. But would love opinions.
Data rate I'm looking at is approx. 10GB/day - continuously ... and likely may
need to archive off anything over a few days old (so 50GB might be a reasonable
max storage).
I'd like the data to be fed in realtime and not require a rack of 100 servers
to do it ...
Ideas welcome ! (including "your insane just use another tool").
----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]
812-482-5224
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general