On Sat, Sep 28, 2013 at 9:23 AM, Colton McInroy <[email protected]>wrote:
> So, basically your suggesting I use this undocumented Bulk MapReduce > method to add all of the data live as it comes in? Do you have an example > or any information on how I would accomplish this? What I could do is have > a flush period, where as the logs come in and get parsed, I build them up > to like 10000 entries or a timed interval, then bulk load them into blur. I will add some documentation on how to use it and probably an example, but I would try using the async client (maybe start with the regular client) first to see if it can keep up. Just as an FYI I found a bug in 0.2.0 mutateBatch that causes a deadlock. I will resolve later today, but if you try it out before 0.2.1 is released (a couple of weeks) you will likely need to patch the code. Here's the issue: https://issues.apache.org/jira/browse/BLUR-245 Aaron > > Thanks, > Colton McInroy > > * Director of Security Engineering > > > Phone > (Toll Free) > _US_ (888)-818-1344 Press 2 > _UK_ 0-800-635-0551 Press 2 > > My Extension 101 > 24/7 Support [email protected] <mailto:[email protected]> > Email [email protected] <mailto:[email protected]> > Website http://www.dosarrest.com > > On 9/28/2013 6:06 AM, Aaron McCurry wrote: > >> So there is a method that is not documented that the Bulk MapReduce uses >> that could fill the gaps between MR and NRT updates. Let's say that there >> a table with 100 shards. In a given table on hdfs the path would look >> like >> "/blur/tables/table12345/**shard-000010/<the main index goes here>". >> >> Now the way MapReduce works is that it creates a sub directory in the main >> index: >> "/blur/tables/table12345/**shard-000010/some_index_name.**tmp/<new data >> here>" >> >> Once the index is ready to be committed the writer is closed for the new >> index and the suddir is renamed to: >> "/blur/tables/table12345/**shard-000010/some_index_name.**commit/<new >> data >> here>" >> >> The act of having an index in the shard directory that ends with ".commit" >> makes the shard pick up the index and do an index merge through the >> writer.addDirectory(..) call. It checks for this every 10 seconds. >> >> While this is not a really easy to integrate yet. I think that if I build >> an Apache Flume integration it will likely make use of this feature, or at >> least have an option to use it. >> >> >> As far as searching multiple tables, this has been asked for before so I >> think that is something that we should add. It actually shouldn't be that >> difficult. >> >> Aaron >> >> >> On Sat, Sep 28, 2013 at 5:59 AM, Colton McInroy <[email protected] >> >wrote: >> >> I actually didn't have any kind of loading interval, I loaded the new >>> event log entries into the index in real time. My code runs as a daemon >>> accepting syslog entries which indexes them live as they come in with a >>> flush call every 10000 entries or 1 minute, which ever comes first. >>> >>> And I don't want to have any limitation on lookback time. I want to be >>> able to look at the history of any site going back years if need be. >>> >>> Sucks there is no multi table reader, that limits what I can do by a bit. >>> >>> >>> Thanks, >>> Colton McInroy >>> >>> * Director of Security Engineering >>> >>> >>> Phone >>> (Toll Free) >>> _US_ (888)-818-1344 Press 2 >>> _UK_ 0-800-635-0551 Press 2 >>> >>> My Extension 101 >>> 24/7 Support [email protected] <mailto:[email protected]> >>> Email [email protected] <mailto:[email protected]> >>> Website http://www.dosarrest.com >>> >>> On 9/28/2013 12:48 AM, Garrett Barton wrote: >>> >>> Mapreduce is a bulk entrypoint to loading blur. Much in the same way I >>>> bet >>>> you have some fancy code to grab up a bunch of log files, over some kind >>>> of >>>> interval and load them into your index, MR replaces that process with >>>> an >>>> auto scaling (via hardware additions only) high bandwidth load that you >>>> could fire off at any interval you want. The MR bulk load writes a new >>>> index and merges that into the index already running when it completes. >>>> The >>>> catch is that it is NOT as efficient as your implementation is in terms >>>> of >>>> latency into the index. So your current impl that will load a small >>>> sites >>>> couple of mb real fast, MR might take 30 seconds to a minute to bring >>>> that >>>> online. Having said that blur has a realtime api for inserting that has >>>> low >>>> latency but you trade in your high bandwidth for it. Might be something >>>> you >>>> could detect on your front door and decide which way in the data comes. >>>> >>>> When I was in your shoes, highly optimizing your indexes based on size >>>> and >>>> load for a single badass machine and doing manual partitioning tricks to >>>> keep things snappy was key. The neat thing about blur is some of that >>>> you >>>> don't do anymore. I would call it an early optimization at this point >>>> to >>>> do anything shorter than say a day or whatever your max lookback time >>>> is. >>>> (Oh btw you can't search across tables in blur, forgot to mention that.) >>>> >>>> Instead of the lots of tables route I suggest trying one large one and >>>> seeing where that goes. Utilize blurs cache initializing capabilities >>>> and >>>> load in your site and time columns to keep your logical partitioning >>>> columns in the block cache and thus very fast. I bet you will see good >>>> performance with this approach. Certainly better than es. Not as fast as >>>> raw lucene, but there is always a price to pay for distributing and so >>>> far >>>> blur has the lowest overhead I've seen. >>>> >>>> Hope that helps some. >>>> On Sep 27, 2013 11:31 PM, "Colton McInroy" <[email protected]> >>>> wrote: >>>> >>>> Coments inline... >>>> >>>>> Thanks, >>>>> Colton McInroy >>>>> >>>>> * Director of Security Engineering >>>>> >>>>> >>>>> Phone >>>>> (Toll Free) >>>>> _US_ (888)-818-1344 Press 2 >>>>> _UK_ 0-800-635-0551 Press 2 >>>>> >>>>> My Extension 101 >>>>> 24/7 Support [email protected] <mailto:[email protected]> >>>>> Email [email protected] <mailto:[email protected]> >>>>> Website http://www.dosarrest.com >>>>> >>>>> On 9/27/2013 5:02 AM, Aaron McCurry wrote: >>>>> >>>>> I have commented inline below: >>>>> >>>>>> >>>>>> On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy < >>>>>> [email protected] >>>>>> >>>>>> wrote: >>>>>>> I do have a few question if you don't mind... I am still >>>>>>> trying >>>>>>> to >>>>>>> wrap my head around how this works. In my current implementation for >>>>>>> a >>>>>>> logging system I create new indexes for each hour because I have a >>>>>>> massive >>>>>>> amount of data coming in. I take in live log data from syslog and >>>>>>> parse/store it in hourly lucene indexes along with a facet index. I >>>>>>> want >>>>>>> to >>>>>>> turn this into a distributed redundant system and blur appears to be >>>>>>> the >>>>>>> way to go. I tried elasticsearch but it is just too slow compared to >>>>>>> my >>>>>>> current implementation. Given I take in gigs of raw log data an >>>>>>> hour, I >>>>>>> need something that is robust and able to keep up with in flow of >>>>>>> data. >>>>>>> >>>>>>> Due to the current implementation of building up an index for an >>>>>>> hour >>>>>>> >>>>>>> and >>>>>> then making available. I would use MapReduce for this: >>>>>> >>>>>> http://incubator.apache.org/******blur/docs/0.2.0/using-blur.****<http://incubator.apache.org/****blur/docs/0.2.0/using-blur.**> >>>>>> <http://incubator.apache.org/****blur/docs/0.2.0/using-blur.****<http://incubator.apache.org/**blur/docs/0.2.0/using-blur.**> >>>>>> > >>>>>> html#map-reduce<http://**incub**ator.apache.org/blur/**<http://incubator.apache.org/blur/**> >>>>>> docs/0.2.0/using-blur.html#****map-reduce<http://incubator.** >>>>>> apache.org/blur/docs/0.2.0/**using-blur.html#map-reduce<http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce> >>>>>> > >>>>>> >>>>>> That way all the shards in a table get a little more data each hour >>>>>> and >>>>>> it's very low impact on the running cluster. >>>>>> >>>>>> Not sure I understand this. I would like data to be accessible live >>>>>> as >>>>>> >>>>> it >>>>> comes in, not wait an hour before I can query against it. >>>>> I am also not sure where map-reduce comes in here. I thought mapreduce >>>>> is >>>>> something that blur used internally. >>>>> >>>>> When taking in lots of data constantly, how is it recommended >>>>> >>>>>> that it >>>>>> >>>>>> be stored? I mentioned above that I create a new index for each hour >>>>>>> to >>>>>>> keep data separated and quicker to search. If I want to look up a >>>>>>> specific >>>>>>> time frame, I only have to load the directories timestamped with the >>>>>>> hours >>>>>>> I want to look at. So instead of having to look at a huge index of >>>>>>> like a >>>>>>> years worth of data, i'm looking at a much smaller data set which >>>>>>> results >>>>>>> in faster query response times. Should a new table be created for >>>>>>> each >>>>>>> hour >>>>>>> of data? When I typed in the create command into the shell, it takes >>>>>>> about >>>>>>> 6 seconds to create a table. If I have to create a table for each >>>>>>> application each hour, this could create a lot of lag. Perhaps this >>>>>>> is >>>>>>> just >>>>>>> in my test environment though. Any thoughts on this? I also didn't >>>>>>> see >>>>>>> any >>>>>>> examples of how to create tables via code. >>>>>>> >>>>>>> First off Blur is designed to store very large amounts of data. >>>>>>> And >>>>>>> >>>>>>> while >>>>>> it can do NRT updates like Solr and ES it's main focus in on bulk >>>>>> ingestion >>>>>> through MapReduce. Given that, the real limiting factor is how much >>>>>> hardware you have. Let's play out a scenario. If you are adding 10GB >>>>>> of >>>>>> data an hour and I would think that a good rough ballpark guess is >>>>>> that >>>>>> you >>>>>> will need 10-15% of inbound data size as memory to make the search >>>>>> perform >>>>>> well. However as the index sizes increase this % may decrease over >>>>>> time. >>>>>> Blur has an off-heap lru cache to make accessing hdfs faster, >>>>>> however if >>>>>> you don't have enough memory the searches (and the cluster for that >>>>>> matter) >>>>>> won't fail, they will simply become slower. >>>>>> >>>>>> So it's really a question of how much hardware you have. If you have >>>>>> filling a table enough to where it does perform well given the cluster >>>>>> you >>>>>> have. You might have to break it into pieces. But I think that >>>>>> hourly >>>>>> is >>>>>> too small. Daily, Weekly, Monthly, etc. >>>>>> >>>>>> In my current system (which uses just lucene) I designed we take in >>>>>> >>>>> mainly >>>>> web logs and separate them into indexes. Each web server gets it's own >>>>> index for each hour. Then when I need to query the data, I use a multi >>>>> index reader to access the timeframe I need allowing me to keep the >>>>> size >>>>> of >>>>> index down to roughly what I need to search. If data was stored over a >>>>> month, and I want to query data that happened in just a single hour, >>>>> or a >>>>> few minutes, it makes sense to me to keep things optimized. Also, if I >>>>> wanted to compare one web server to another, I would just use the multi >>>>> index reader to load both indexes. This is all handled by a single >>>>> server >>>>> though, so it is limited by the hardware of the single server. If >>>>> something >>>>> fails, it's a big problem. When trying to query large data sets, it's >>>>> again, only a single server, so it takes longer than I would like if >>>>> the >>>>> index it's reading is large. >>>>> I am not entirely sure how to go about doing this in blur. I'm >>>>> imagining >>>>> that each "table" is an index. So I would have a table format like... >>>>> YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple >>>>> tables... like a milti table reader or something? or am I limited to >>>>> looking at a single table at a time? >>>>> For some web servers that have little traffic, an hour of data may only >>>>> have a few mb of data in it while other may have like a 5-10gb index. >>>>> If >>>>> I >>>>> combined the index from a large site with the small sites, this should >>>>> make >>>>> everything slower for the queries against the small sites index >>>>> correct? >>>>> Or >>>>> would it all be the same due to how blur separates indexes into shards? >>>>> Would it perhaps be better to have an index for each web server, and >>>>> configure small sites to have less shards while larger sites have more >>>>> shards? >>>>> We just got a new really large powerful server to be our log server, >>>>> but >>>>> as I realize that it's a single point of failure, I want to change our >>>>> configuration to use a clustered/distributed configuration. So we would >>>>> start with probably a minimal configuration, and start adding more >>>>> shard >>>>> servers when ever we can afford it or need it. >>>>> >>>>> Do shards contain the index data while the location (hdfs) >>>>> >>>>>> contains >>>>>> >>>>>> the documents (what lucene referred to them as)? I read that the >>>>>>> shard >>>>>>> contains the index while the fs contains the data... I just wasn't >>>>>>> quiet >>>>>>> sure what the data was, because when I work with lucene, the index >>>>>>> directory contains the data as a document. >>>>>>> >>>>>>> The shard is stored in HDFS, and it is a Lucene index. We store >>>>>>> the >>>>>>> >>>>>> data >>>>>> inside the Lucene index, so it's basically Lucene all the way down to >>>>>> HDFS. >>>>>> >>>>>> Ok, so basically a controller is a service which connects to all (or >>>>>> >>>>> some?) shards a distributed query, which tells the shard to run a query >>>>> against a certain data set, that shard then gets that data set either >>>>> from >>>>> memory or from the hadoop cluster, processes it, and returns the result >>>>> to >>>>> the controller which condenses the results from all the queried shards >>>>> into >>>>> a final result right? >>>>> >>>>> Hope this helps. Let us know if you have more questions. >>>>> >>>>>> Thanks, >>>>>> Aaron >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>>> Colton McInroy >>>>>>> >>>>>>> * Director of Security Engineering >>>>>>> >>>>>>> >>>>>>> Phone >>>>>>> (Toll Free) >>>>>>> _US_ (888)-818-1344 Press 2 >>>>>>> _UK_ 0-800-635-0551 Press 2 >>>>>>> >>>>>>> My Extension 101 >>>>>>> 24/7 Support [email protected] <mailto:[email protected]> >>>>>>> Email [email protected] <mailto:[email protected]> >>>>>>> Website http://www.dosarrest.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >
