Re: [MarkLogic Dev General] "Minimum" overhead for loading documents

Darin McBeath Thu, 30 Jun 2011 07:08:10 -0700

I don't know about mySQL, but you might want to investigate other noSQL 
options.  On the surface, while I'm sure you can get MarkLogic (a noSQL option) 
to work here, I really wonder if it is your best choice.    On the surface, 
this sounds like a more straightforward key-value problem (as in Cassandra) or 
very basic document problem (think MongoDB).  If you do some google searches, 
you will find many examples of people that have used MongoDB explicitly for 
this purpose.  In addition, depending on the type of analysis you plan on doing 
I assume you might need some Hadoop jobs as well.  There is currently built-in 
support for Hadoop in both Cassandra and MongoDB ... I believe a Hadoop 
connector will be coming in the next big release of MarkLogic (at least that 
was my understanding from the user conference).

Darin.

________________________________
From: "Lee, David" <[email protected]>
To: "General Mark Logic Developer Discussion ([email protected])" 
<[email protected]>
Sent: Wednesday, June 29, 2011 7:06 PM
Subject: [MarkLogic Dev General] "Minimum" overhead for loading documents

I'm considering using MarkLogic as a log file analyzer.
This means storing possibly 100's of GB of fairly flat structure ( think log4j, 
apache, and tomcat output ).
Why use ML at all when maybe something like mysql would be better ?
I think the dateTime indexing and freefiorm text searching would be extremely 
valuable.
Also, although the raw data is "flat" resultant analized data may well be 
hierarchal (imagine creating a call stack diagram from log files).   I think 
this is a perfect use of ML and XQuery.  (but I may be insane).

With that in mind I'm curious how to make this efficient in time & space.
If I make a new forest & database just for this ...what minimizes the time to 
load and key space ?
My *guess* is to minimize all the parameters in the database affecting indexing 
to the bare minimum, possibly none except for explicit indexes ... or maybe a 
simple word index.    But would love opinions.

Data rate I'm looking at is approx. 10GB/day  - continuously ... and likely may 
need to archive off anything over a few days old (so 50GB might be a reasonable 
max storage).

I'd like the data to be fed in realtime and not require a rack of 100 servers 
to do it ...

Ideas welcome ! (including "your insane just use another tool").

----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]
812-482-5224

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] "Minimum" overhead for loading documents

Reply via email to