I have tried all of those systems, and with the amount of data we deal with, they don't stand up very well to what we need done. I have already designed 2 log systems, and this will be hopefully my third and final implementation. This time being a distributed version of my previous work.

It does take time to develop things, but for the amount of time I would spend developing with this, it's well worth it. I have tried other free systems out there like ElasticSearch and the likes, but they all fall short. With my software running on a dual intel xeon [email protected] with 32hdd's in hardware raid (5 or 6, I forget which) can process about 10,000-20,000 entries per second while putting the server load at about 2.00-5.00 over 5 minutes and about 50-70% disk utilization. When using other systems like ElasticSearch, the server hits a load average of over 50.00 and 100% disk utilization which makes most of those solutions inadequate for what I am trying to do.

I also need something that I can adjust the log formats on the fly, adding new ones, detecting entries that have no patterns, etc. Splunk is the only thing I have seen which can properly handle the amount of log data I am dealing with in terms of both indexing and searching, but it is just extremely expensive. I have no idea how they can get away with such a ridiculas pricing scheme. I have talked to their developers at conferences and they agreed the prices are just crazy. For the price of just one years subscription fee, I can and will pay myself to build something that will do roughly the same job as well as buy the hardware to start the clusters.

Thanks,
Colton McInroy

 * Director of Security Engineering

        
Phone
(Toll Free)     
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/1/2013 6:59 PM, Otis Gospodnetic wrote:
Hi,

I'm joining this very informative thread a little late. Splunk is
pricey.  But why not go for Loggly or something along those lines?
Building a good, scalable, fast logging system with more than just
basic features and with a nice UI takes time, and time is priceless.
If it's of interest, we just released Logsene internally here at
Sematext ( http://sematext.com/logsene/ ) and have Kibana running on
top of it.  May be something to consider if you want a logging system
todayish.

Back to  observer mode...

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Sat, Sep 28, 2013 at 9:13 AM, Colton McInroy <[email protected]> wrote:
Thanks for the info, appreciate it.

Yes I have looked at splunk... for the amount of data I am dealing with,
they want $500,000 outright, with a $200,000 yearly subscription. As we
scale up, the price goes up as well, their pricing scheme is absolutely
nuts. I figure for less than the outright cost, I could build myself a
hadoop cluster, a blur custer, and build the software myself that does just
as good if not better than what splunk does. That's why I am working on this
project right now. I made something with with lucene itself, but to ensure
redundancy and scalability, I wanted to go with something distributed.
Before I started developing my own project from scratch, I took a look at
what was already out there for merging hadoop and lucene, which is what led
me to blur. I am hoping that using blur with hadoop will allow me to manage
the large amount of log data that I have constantly flowing in.

Right now it's all on one server that has over 100tb of disk space, lots of
ram and cores, but as I watch the load of the system right now, I realize
that at some point, the single server just isn't going to cut it. At some
point the level of data will go above what any single hardware box can do.

Being the data is log entries, my goal is to be able to store/index all log
data that comes in real time and make it easily searchable while using facet
data to obtain metrics while being distributed across a redundant
infrastructure.

My plan is to use this information to correlate stuff that occurs across
large timespans. Like seeing what traffic levels last year during this month
where compared to the month this year. Or seeing what servers an IP has
accessed over the past year, etc.

 From what I see so far, I will be building a hadoop cluster to store the
data, a blur cluster to process the data, and then making a parser which
takes in data with various formats to takes the data and passes it off to
blur. Then I will have a client which handles the search queries against
it... which actually brings up another question... If I parse data one way,
but then craft a new parser, how well does blur handle changing records?...
Like say I do not have a parser that handles a particular log entry. So that
line ends up being logged as just a message field with the contents of the
data stored in the message field. But then later I figure out a way to parse
that line into custom fields. Does the mutation system work well for then
when manipulating a lot of records... like say going over a month, or even a
years worth of entries matching a certain query?


Thanks,
Colton McInroy

  * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 9/28/2013 5:29 AM, Garrett Barton wrote:
The going back over a certain time was just a suggestion based on a guess
of your query needs.  Personally I would go for one large index to begin
with, if performance ever became a problem and I could not add a few more
nodes to the cluster, I would then consider splitting the index.

   In the next month or so I will be prototyping 2 very interesting
indexes.
   One will be a massive, massive full text index that I plan on using bulk
MR running every hour into.  The other has to be able to load several
TB/hour and I will be trying that on a much shorter MR schedule, say every
5-10 minutes. I expect both to work fine.

    I don't think its that far of a stretch for you to go to the minute
level
like you have today with the MR approach, or hell try the thrift api, with
enough nodes I bet it would handle that kinda load as well.

Just slightly off topic, have you looked at splunk?  Does what your trying
to do out of the box.


On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy
<[email protected]>wrote:

I actually didn't have any kind of loading interval, I loaded the new
event log entries into the index in real time. My code runs as a daemon
accepting syslog entries which indexes them live as they come in with a
flush call every 10000 entries or 1 minute, which ever comes first.

And I don't want to have any limitation on lookback time. I want to be
able to look at the history of any site going back years if need be.

Sucks there is no multi table reader, that limits what I can do by a bit.


Thanks,
Colton McInroy

   * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 9/28/2013 12:48 AM, Garrett Barton wrote:

Mapreduce is a bulk entrypoint to loading blur. Much in the same way I
bet
you have some fancy code to grab up a bunch of log files, over some kind
of
interval and load them into your index,  MR replaces that process with
an
auto scaling (via hardware additions only) high bandwidth load that you
could fire off at any interval you want. The MR bulk load writes a new
index and merges that into the index already running when it completes.
The
catch is that it is NOT as efficient as your implementation is in terms
of
latency into the index. So your current impl that will load a small
sites
couple of mb real fast, MR might take 30 seconds to a minute to bring
that
online. Having said that blur has a realtime api for inserting that has
low
latency but you trade in your high bandwidth for it. Might be something
you
could detect on your front door and decide which way in the data comes.

When I was in your shoes, highly optimizing your indexes based on size
and
load for a single badass machine and doing manual partitioning tricks to
keep things snappy was key.  The neat thing about blur is some of that
you
don't do anymore.  I would call it an early optimization at this point
to
do anything shorter than say a day or whatever your max lookback time
is.
(Oh btw you can't search across tables in blur, forgot to mention that.)

Instead of the lots of tables route I suggest trying one large one and
seeing where that goes. Utilize blurs cache initializing capabilities
and
load in your site and time columns to keep your logical partitioning
columns in the block cache and thus very fast. I bet you will see good
performance with this approach. Certainly better than es. Not as fast as
raw lucene, but there is always a price to pay for distributing and so
far
blur has the lowest overhead I've seen.

Hope that helps some.
On Sep 27, 2013 11:31 PM, "Colton McInroy" <[email protected]> wrote:

   Coments inline...
Thanks,
Colton McInroy

    * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 9/27/2013 5:02 AM, Aaron McCurry wrote:

   I have commented inline below:

On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <[email protected]

wrote:
        I do have a few question if you don't mind... I am still
trying
to
wrap my head around how this works. In my current implementation for
a
logging system I create new indexes for each hour because I have a
massive
amount of data coming in. I take in live log data from syslog and
parse/store it in hourly lucene indexes along with a facet index. I
want
to
turn this into a distributed redundant system and blur appears to be
the
way to go. I tried elasticsearch but it is just too slow compared to
my
current implementation. Given I take in gigs of raw log data an hour,
I
need something that is robust and able to keep up with in flow of
data.

    Due to the current implementation of building up an index for an
hour

and
then making available.  I would use MapReduce for this:


http://incubator.apache.org/****blur/docs/0.2.0/using-blur.**<http://incubator.apache.org/**blur/docs/0.2.0/using-blur.**>
html#map-reduce<http://**incubator.apache.org/blur/**

docs/0.2.0/using-blur.html#**map-reduce<http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce>

That way all the shards in a table get a little more data each hour
and
it's very low impact on the running cluster.

   Not sure I understand this. I would like data to be accessible live
as
it
comes in, not wait an hour before I can query against it.
I am also not sure where map-reduce comes in here. I thought mapreduce
is
something that blur used internally.

         When taking in lots of data constantly, how is it recommended
that it

be stored? I mentioned above that I create a new index for each hour
to
keep data separated and quicker to search. If I want to look up a
specific
time frame, I only have to load the directories timestamped with the
hours
I want to look at. So instead of having to look at a huge index of
like a
years worth of data, i'm looking at a much smaller data set which
results
in faster query response times. Should a new table be created for
each
hour
of data? When I typed in the create command into the shell, it takes
about
6 seconds to create a table. If I have to create a table for each
application each hour, this could create a lot of lag. Perhaps this
is
just
in my test environment though. Any thoughts on this? I also didn't
see
any
examples of how to create tables via code.

    First off Blur is designed to store very large amounts of data.
And

while
it can do NRT updates like Solr and ES it's main focus in on bulk
ingestion
through MapReduce.  Given that, the real limiting factor is how much
hardware you have.  Let's play out a scenario.  If you are adding 10GB
of
data an hour and I would think that a good rough ballpark guess is
that
you
will need 10-15% of inbound data size as memory to make the search
perform
well.  However as the index sizes increase this % may decrease over
time.
     Blur has an off-heap lru cache to make accessing hdfs faster,
however if
you don't have enough memory the searches (and the cluster for that
matter)
won't fail, they will simply become slower.

So it's really a question of how much hardware you have.  If you have
filling a table enough to where it does perform well given the cluster
you
have.  You might have to break it into pieces.  But I think that
hourly
is
too small.  Daily, Weekly, Monthly, etc.

   In my current system (which uses just lucene) I designed we take in
mainly
web logs and separate them into indexes. Each web server gets it's own
index for each hour. Then when I need to query the data, I use a multi
index reader to access the timeframe I need allowing me to keep the
size
of
index down to roughly what I need to search. If data was stored over a
month, and I want to query data that happened in just a single hour, or
a
few minutes, it makes sense to me to keep things optimized. Also, if I
wanted to compare one web server to another, I would just use the multi
index reader to load both indexes. This is all handled by a single
server
though, so it is limited by the hardware of the single server. If
something
fails, it's a big problem. When trying to query large data sets, it's
again, only a single server, so it takes longer than I would like if
the
index it's reading is large.
I am not entirely sure how to go about doing this in blur. I'm
imagining
that each "table" is an index. So I would have a table format like...
YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple
tables... like a milti table reader or something? or am I limited to
looking at a single table at a time?
For some web servers that have little traffic, an hour of data may only
have a few mb of data in it while other may have like a 5-10gb index.
If
I
combined the index from a large site with the small sites, this should
make
everything slower for the queries against the small sites index
correct?
Or
would it all be the same due to how blur separates indexes into shards?
Would it perhaps be better to have an index for each web server, and
configure small sites to have less shards while larger sites have more
shards?
We just got a new really large powerful server to be our log server,
but
as I realize that it's a single point of failure, I want to change our
configuration to use a clustered/distributed configuration. So we would
start with probably a minimal configuration, and start adding more
shard
servers when ever we can afford it or need it.

         Do shards contain the index data while the location (hdfs)
contains

the documents (what lucene referred to them as)? I read that the
shard
contains the index while the fs contains the data... I just wasn't
quiet
sure what the data was, because when I work with lucene, the index
directory contains the data as a document.

   The shard is stored in HDFS, and it is a Lucene index.  We store
the
data
inside the Lucene index, so it's basically Lucene all the way down to
HDFS.

   Ok, so basically a controller is a service which connects to all (or
some?) shards a distributed query, which tells the shard to run a query
against a certain data set, that shard then gets that data set either
from
memory or from the hadoop cluster, processes it, and returns the result
to
the controller which condenses the results from all the queried shards
into
a final result right?

   Hope this helps.  Let us know if you have more questions.
Thanks,
Aaron



   Thanks,
Colton McInroy

     * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com




Reply via email to