Thanks,
Colton McInroy

 * Director of Security Engineering

        
Phone
(Toll Free)     
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/19/2013 6:38 AM, Aaron McCurry wrote:
Colton,

You always have great emails.  :-)
Thank you :)

I will attempt to add some value beyond Garrett's email.
Thanks, he did a fairly good job at answering a chunk of them.


Ok, so Lucene's limits are a little different than what you have described
(I think).  Based on https://issues.apache.org/jira/browse/LUCENE-2257 the
limit is not per index but instead per segment within the index.  Plus by
default Lucene will not merge segments that are over 5GB in size, so that
provides a kind of ceiling for the size of segment.  Overall the biggest
limit you will likely see is the 2 Billion document (Records in our case)
limit.  I think a Lucene index could exceed that, but the counts are based
on integers.  Given that I would think that you might want to calculate a
top end of 500 million per index (I have run 400+ million in a single index
in the past but keep things around 100 million now) so that you have plenty
of room before you get into the unknown limits of a single Lucene index.

At this point I think the biggest question mark in the system from want you
have described is how many shard servers you may need to have.  I have
tested Blur up to 200 shard servers in a single cluster and the controllers
seem to do just fine.  But at some point there will likely need to be some
work done to expand to 1000's of shard servers within a single cluster.
  Where is that point, I'm not sure, but I think that there may likely need
a second layer of controllers to reduce network hotspots.  But that's just
a guess.
Hmmm... Ok, so to figure out the max, I basically just need to change this part of the calculation...

(Integer.MAX_VALUE * luceneTermIndexInterval * shardCount)

to

(Integer.MAX_VALUE * luceneTermIndexInterval * segmentsWithinIndex * shardCount)


At first, I will be starting off with probably the bare minimum hardware we can, but then expand from there. Typically we go with supermicro 1u xeon servers, in this situation though I understand hadoop/blur are meant to work on commodity hardware. What effect would having good/large servers on the same cluster as slower smaller servers? Would it still add to the performance, or would it cause some delays for any requests possibly hitting that server because it doesn't have the same amount of cpu power or memory?

My concern with this is that I have a constantly non-stop stream of new data coming in which has to get indexed as soon as possible so that it can be queried/displayed using various custom tools. The events per second coming in could be 10,000, 100,000, or even higher than that per second, perhaps 1,000,000. I currently have done as you suggested, separate it into only a few tables, rather than one every hour or something like I originally mentioned. Now if there is this much data coming in non-stop for years, I need to try and figure out the problems I would run into. From what I can see, table shard counts can only be set once, so if you wanted to grow that table, I couldn't go from like 10 shards, to 100. Is this coming that will be possible in the future? And, if I have that much data coming in constantly, is using a single table ok, ideal, or would it be better to separate it up?, in which case, how often should I create new tables, and I would need the capability to query multiple tables at a time for it to work/make that feature myself/build my own post processor so aggregate them until the feature is added. At 10,000 entries (an entry consist of 1 row with two records) per second sustained, that's just shy of 14 hours until i've hit 500 million rows in a table (7 hours until 500 million records). When doing 1,000,000 rows per second, that would less than just 10 minutes. Although 1,000,000 rows per second is fairly high, some day, I expect to hit that. For now I would expect anywhere from 10,000 to 100,000 rows per second, which is about 1-10 hours for 500 million rows. My data flow will not stop, and I need fast access to it... primarily the newest data, but historical data has some importance as well. I am in the DDoS mitigation business though, so I take on large DDoS attacks successfully and log all the requests to analyze them, which is why this is so important. For the short term data, I need to be able to see what is going on right now on in any of the logs, then perhaps compare that to past requests around that time of year or something. Ideally using a single table for all information would obviously make it easy for coding purposes for me, but given the level of data, what would you recommend? 1 table per month with a 100 shard count? 1 table per year with 1000 a shard count, or perhaps just one table period with a 1000 shard count and use that until it fills up, which hopefully isn't for a really long time... long enough time that the number of shards can be changed or something?

If I am correct on all of this, what kind of effect does having a large
shard count (1000+)? I notice that tables when creating them seem to take
for ever in comparison to the file structure made. I've seen the create
process take 10+ second, but in the background I look, and see the file
structure created, check the disk space used while the create command is
running, and there is no change. I've tried both in hdfs and localfs. I am
doing all of this in a virtual set of sessions right now, so it may not be
the best testing platform. I'm in the process of requisitioning the
hardware though to start building the cluster at our data center. Before I
start putting all of the money into it, I want to see what kind of
limitations I may be up against if any.

The delay is likely because all 1000 of indexes are opened for writing and
their nrt threads are started up and there is a lot of ZooKeeper logic to
get everything into a known state.  Also since creating a table is not
something I have optimized for speed, it's more of a get it right process
instead of get it done fast.
So far, all the tables I have made have only a 10 shard count, so it shouldn't be that. Depending upon what you suggest above, I may need this process to be a bit more optimized. When dealing with big data that is constantly flowing at you, stopping for 10+ seconds to create an index can back up to millions of entries that need to be flushed that could cause all kinds of problems/delays. Especially if it takes about 10+ seconds for just a 10 shard count... would that mean it takes 1000+ seconds for a 1000 shard count? I haven't tried anything othat than 10 so far really.

Aaron


It's been a long night for me so far, and still not over, I got meetings
to go to soon. Hopefully I havent rabled on too much ;)

I have been trying to keep close with the git code so that I can deal with
the changes as they come, as well as help where I can, although I still got
a lot of code reading left to do on the blur project.

--
Thanks,
Colton McInroy

  * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com



Reply via email to