Hi, I'm joining this very informative thread a little late. Splunk is pricey. But why not go for Loggly or something along those lines? Building a good, scalable, fast logging system with more than just basic features and with a nice UI takes time, and time is priceless. If it's of interest, we just released Logsene internally here at Sematext ( http://sematext.com/logsene/ ) and have Kibana running on top of it. May be something to consider if you want a logging system todayish.
Back to observer mode... Otis -- Solr & ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Sat, Sep 28, 2013 at 9:13 AM, Colton McInroy <[email protected]> wrote: > Thanks for the info, appreciate it. > > Yes I have looked at splunk... for the amount of data I am dealing with, > they want $500,000 outright, with a $200,000 yearly subscription. As we > scale up, the price goes up as well, their pricing scheme is absolutely > nuts. I figure for less than the outright cost, I could build myself a > hadoop cluster, a blur custer, and build the software myself that does just > as good if not better than what splunk does. That's why I am working on this > project right now. I made something with with lucene itself, but to ensure > redundancy and scalability, I wanted to go with something distributed. > Before I started developing my own project from scratch, I took a look at > what was already out there for merging hadoop and lucene, which is what led > me to blur. I am hoping that using blur with hadoop will allow me to manage > the large amount of log data that I have constantly flowing in. > > Right now it's all on one server that has over 100tb of disk space, lots of > ram and cores, but as I watch the load of the system right now, I realize > that at some point, the single server just isn't going to cut it. At some > point the level of data will go above what any single hardware box can do. > > Being the data is log entries, my goal is to be able to store/index all log > data that comes in real time and make it easily searchable while using facet > data to obtain metrics while being distributed across a redundant > infrastructure. > > My plan is to use this information to correlate stuff that occurs across > large timespans. Like seeing what traffic levels last year during this month > where compared to the month this year. Or seeing what servers an IP has > accessed over the past year, etc. > > From what I see so far, I will be building a hadoop cluster to store the > data, a blur cluster to process the data, and then making a parser which > takes in data with various formats to takes the data and passes it off to > blur. Then I will have a client which handles the search queries against > it... which actually brings up another question... If I parse data one way, > but then craft a new parser, how well does blur handle changing records?... > Like say I do not have a parser that handles a particular log entry. So that > line ends up being logged as just a message field with the contents of the > data stored in the message field. But then later I figure out a way to parse > that line into custom fields. Does the mutation system work well for then > when manipulating a lot of records... like say going over a month, or even a > years worth of entries matching a certain query? > > > Thanks, > Colton McInroy > > * Director of Security Engineering > > > Phone > (Toll Free) > _US_ (888)-818-1344 Press 2 > _UK_ 0-800-635-0551 Press 2 > > My Extension 101 > 24/7 Support [email protected] <mailto:[email protected]> > Email [email protected] <mailto:[email protected]> > Website http://www.dosarrest.com > > On 9/28/2013 5:29 AM, Garrett Barton wrote: >> >> The going back over a certain time was just a suggestion based on a guess >> of your query needs. Personally I would go for one large index to begin >> with, if performance ever became a problem and I could not add a few more >> nodes to the cluster, I would then consider splitting the index. >> >> In the next month or so I will be prototyping 2 very interesting >> indexes. >> One will be a massive, massive full text index that I plan on using bulk >> MR running every hour into. The other has to be able to load several >> TB/hour and I will be trying that on a much shorter MR schedule, say every >> 5-10 minutes. I expect both to work fine. >> >> I don't think its that far of a stretch for you to go to the minute >> level >> like you have today with the MR approach, or hell try the thrift api, with >> enough nodes I bet it would handle that kinda load as well. >> >> Just slightly off topic, have you looked at splunk? Does what your trying >> to do out of the box. >> >> >> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy >> <[email protected]>wrote: >> >>> I actually didn't have any kind of loading interval, I loaded the new >>> event log entries into the index in real time. My code runs as a daemon >>> accepting syslog entries which indexes them live as they come in with a >>> flush call every 10000 entries or 1 minute, which ever comes first. >>> >>> And I don't want to have any limitation on lookback time. I want to be >>> able to look at the history of any site going back years if need be. >>> >>> Sucks there is no multi table reader, that limits what I can do by a bit. >>> >>> >>> Thanks, >>> Colton McInroy >>> >>> * Director of Security Engineering >>> >>> >>> Phone >>> (Toll Free) >>> _US_ (888)-818-1344 Press 2 >>> _UK_ 0-800-635-0551 Press 2 >>> >>> My Extension 101 >>> 24/7 Support [email protected] <mailto:[email protected]> >>> Email [email protected] <mailto:[email protected]> >>> Website http://www.dosarrest.com >>> >>> On 9/28/2013 12:48 AM, Garrett Barton wrote: >>> >>>> Mapreduce is a bulk entrypoint to loading blur. Much in the same way I >>>> bet >>>> you have some fancy code to grab up a bunch of log files, over some kind >>>> of >>>> interval and load them into your index, MR replaces that process with >>>> an >>>> auto scaling (via hardware additions only) high bandwidth load that you >>>> could fire off at any interval you want. The MR bulk load writes a new >>>> index and merges that into the index already running when it completes. >>>> The >>>> catch is that it is NOT as efficient as your implementation is in terms >>>> of >>>> latency into the index. So your current impl that will load a small >>>> sites >>>> couple of mb real fast, MR might take 30 seconds to a minute to bring >>>> that >>>> online. Having said that blur has a realtime api for inserting that has >>>> low >>>> latency but you trade in your high bandwidth for it. Might be something >>>> you >>>> could detect on your front door and decide which way in the data comes. >>>> >>>> When I was in your shoes, highly optimizing your indexes based on size >>>> and >>>> load for a single badass machine and doing manual partitioning tricks to >>>> keep things snappy was key. The neat thing about blur is some of that >>>> you >>>> don't do anymore. I would call it an early optimization at this point >>>> to >>>> do anything shorter than say a day or whatever your max lookback time >>>> is. >>>> (Oh btw you can't search across tables in blur, forgot to mention that.) >>>> >>>> Instead of the lots of tables route I suggest trying one large one and >>>> seeing where that goes. Utilize blurs cache initializing capabilities >>>> and >>>> load in your site and time columns to keep your logical partitioning >>>> columns in the block cache and thus very fast. I bet you will see good >>>> performance with this approach. Certainly better than es. Not as fast as >>>> raw lucene, but there is always a price to pay for distributing and so >>>> far >>>> blur has the lowest overhead I've seen. >>>> >>>> Hope that helps some. >>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <[email protected]> wrote: >>>> >>>> Coments inline... >>>>> >>>>> Thanks, >>>>> Colton McInroy >>>>> >>>>> * Director of Security Engineering >>>>> >>>>> >>>>> Phone >>>>> (Toll Free) >>>>> _US_ (888)-818-1344 Press 2 >>>>> _UK_ 0-800-635-0551 Press 2 >>>>> >>>>> My Extension 101 >>>>> 24/7 Support [email protected] <mailto:[email protected]> >>>>> Email [email protected] <mailto:[email protected]> >>>>> Website http://www.dosarrest.com >>>>> >>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote: >>>>> >>>>> I have commented inline below: >>>>>> >>>>>> >>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <[email protected] >>>>>> >>>>>>> wrote: >>>>>>> I do have a few question if you don't mind... I am still >>>>>>> trying >>>>>>> to >>>>>>> wrap my head around how this works. In my current implementation for >>>>>>> a >>>>>>> logging system I create new indexes for each hour because I have a >>>>>>> massive >>>>>>> amount of data coming in. I take in live log data from syslog and >>>>>>> parse/store it in hourly lucene indexes along with a facet index. I >>>>>>> want >>>>>>> to >>>>>>> turn this into a distributed redundant system and blur appears to be >>>>>>> the >>>>>>> way to go. I tried elasticsearch but it is just too slow compared to >>>>>>> my >>>>>>> current implementation. Given I take in gigs of raw log data an hour, >>>>>>> I >>>>>>> need something that is robust and able to keep up with in flow of >>>>>>> data. >>>>>>> >>>>>>> Due to the current implementation of building up an index for an >>>>>>> hour >>>>>>> >>>>>> and >>>>>> then making available. I would use MapReduce for this: >>>>>> >>>>>> >>>>>> http://incubator.apache.org/****blur/docs/0.2.0/using-blur.**<http://incubator.apache.org/**blur/docs/0.2.0/using-blur.**> >>>>>> html#map-reduce<http://**incubator.apache.org/blur/** >>>>>> >>>>>> docs/0.2.0/using-blur.html#**map-reduce<http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce> >>>>>> >>>>>> That way all the shards in a table get a little more data each hour >>>>>> and >>>>>> it's very low impact on the running cluster. >>>>>> >>>>>> Not sure I understand this. I would like data to be accessible live >>>>>> as >>>>> >>>>> it >>>>> comes in, not wait an hour before I can query against it. >>>>> I am also not sure where map-reduce comes in here. I thought mapreduce >>>>> is >>>>> something that blur used internally. >>>>> >>>>> When taking in lots of data constantly, how is it recommended >>>>>> >>>>>> that it >>>>>> >>>>>>> be stored? I mentioned above that I create a new index for each hour >>>>>>> to >>>>>>> keep data separated and quicker to search. If I want to look up a >>>>>>> specific >>>>>>> time frame, I only have to load the directories timestamped with the >>>>>>> hours >>>>>>> I want to look at. So instead of having to look at a huge index of >>>>>>> like a >>>>>>> years worth of data, i'm looking at a much smaller data set which >>>>>>> results >>>>>>> in faster query response times. Should a new table be created for >>>>>>> each >>>>>>> hour >>>>>>> of data? When I typed in the create command into the shell, it takes >>>>>>> about >>>>>>> 6 seconds to create a table. If I have to create a table for each >>>>>>> application each hour, this could create a lot of lag. Perhaps this >>>>>>> is >>>>>>> just >>>>>>> in my test environment though. Any thoughts on this? I also didn't >>>>>>> see >>>>>>> any >>>>>>> examples of how to create tables via code. >>>>>>> >>>>>>> First off Blur is designed to store very large amounts of data. >>>>>>> And >>>>>>> >>>>>> while >>>>>> it can do NRT updates like Solr and ES it's main focus in on bulk >>>>>> ingestion >>>>>> through MapReduce. Given that, the real limiting factor is how much >>>>>> hardware you have. Let's play out a scenario. If you are adding 10GB >>>>>> of >>>>>> data an hour and I would think that a good rough ballpark guess is >>>>>> that >>>>>> you >>>>>> will need 10-15% of inbound data size as memory to make the search >>>>>> perform >>>>>> well. However as the index sizes increase this % may decrease over >>>>>> time. >>>>>> Blur has an off-heap lru cache to make accessing hdfs faster, >>>>>> however if >>>>>> you don't have enough memory the searches (and the cluster for that >>>>>> matter) >>>>>> won't fail, they will simply become slower. >>>>>> >>>>>> So it's really a question of how much hardware you have. If you have >>>>>> filling a table enough to where it does perform well given the cluster >>>>>> you >>>>>> have. You might have to break it into pieces. But I think that >>>>>> hourly >>>>>> is >>>>>> too small. Daily, Weekly, Monthly, etc. >>>>>> >>>>>> In my current system (which uses just lucene) I designed we take in >>>>> >>>>> mainly >>>>> web logs and separate them into indexes. Each web server gets it's own >>>>> index for each hour. Then when I need to query the data, I use a multi >>>>> index reader to access the timeframe I need allowing me to keep the >>>>> size >>>>> of >>>>> index down to roughly what I need to search. If data was stored over a >>>>> month, and I want to query data that happened in just a single hour, or >>>>> a >>>>> few minutes, it makes sense to me to keep things optimized. Also, if I >>>>> wanted to compare one web server to another, I would just use the multi >>>>> index reader to load both indexes. This is all handled by a single >>>>> server >>>>> though, so it is limited by the hardware of the single server. If >>>>> something >>>>> fails, it's a big problem. When trying to query large data sets, it's >>>>> again, only a single server, so it takes longer than I would like if >>>>> the >>>>> index it's reading is large. >>>>> I am not entirely sure how to go about doing this in blur. I'm >>>>> imagining >>>>> that each "table" is an index. So I would have a table format like... >>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple >>>>> tables... like a milti table reader or something? or am I limited to >>>>> looking at a single table at a time? >>>>> For some web servers that have little traffic, an hour of data may only >>>>> have a few mb of data in it while other may have like a 5-10gb index. >>>>> If >>>>> I >>>>> combined the index from a large site with the small sites, this should >>>>> make >>>>> everything slower for the queries against the small sites index >>>>> correct? >>>>> Or >>>>> would it all be the same due to how blur separates indexes into shards? >>>>> Would it perhaps be better to have an index for each web server, and >>>>> configure small sites to have less shards while larger sites have more >>>>> shards? >>>>> We just got a new really large powerful server to be our log server, >>>>> but >>>>> as I realize that it's a single point of failure, I want to change our >>>>> configuration to use a clustered/distributed configuration. So we would >>>>> start with probably a minimal configuration, and start adding more >>>>> shard >>>>> servers when ever we can afford it or need it. >>>>> >>>>> Do shards contain the index data while the location (hdfs) >>>>>> >>>>>> contains >>>>>> >>>>>>> the documents (what lucene referred to them as)? I read that the >>>>>>> shard >>>>>>> contains the index while the fs contains the data... I just wasn't >>>>>>> quiet >>>>>>> sure what the data was, because when I work with lucene, the index >>>>>>> directory contains the data as a document. >>>>>>> >>>>>>> The shard is stored in HDFS, and it is a Lucene index. We store >>>>>>> the >>>>>> >>>>>> data >>>>>> inside the Lucene index, so it's basically Lucene all the way down to >>>>>> HDFS. >>>>>> >>>>>> Ok, so basically a controller is a service which connects to all (or >>>>> >>>>> some?) shards a distributed query, which tells the shard to run a query >>>>> against a certain data set, that shard then gets that data set either >>>>> from >>>>> memory or from the hadoop cluster, processes it, and returns the result >>>>> to >>>>> the controller which condenses the results from all the queried shards >>>>> into >>>>> a final result right? >>>>> >>>>> Hope this helps. Let us know if you have more questions. >>>>>> >>>>>> Thanks, >>>>>> Aaron >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>>> >>>>>>> Colton McInroy >>>>>>> >>>>>>> * Director of Security Engineering >>>>>>> >>>>>>> >>>>>>> Phone >>>>>>> (Toll Free) >>>>>>> _US_ (888)-818-1344 Press 2 >>>>>>> _UK_ 0-800-635-0551 Press 2 >>>>>>> >>>>>>> My Extension 101 >>>>>>> 24/7 Support [email protected] <mailto:[email protected]> >>>>>>> Email [email protected] <mailto:[email protected]> >>>>>>> Website http://www.dosarrest.com >>>>>>> >>>>>>> >>>>>>> >
