Re: Possible Limitations?

Colton McInroy Sun, 20 Oct 2013 18:57:15 -0700


Thanks,
Colton McInroy


 * Director of Security Engineering

        
Phone
(Toll Free)     
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/19/2013 6:38 AM, Aaron McCurry wrote:

Colton,

You always have great emails.  :-)

Thank you :)


I will attempt to add some value beyond Garrett's email.

Thanks, he did a fairly good job at answering a chunk of them.



Ok, so Lucene's limits are a little different than what you have described
(I think).  Based on https://issues.apache.org/jira/browse/LUCENE-2257 the
limit is not per index but instead per segment within the index.  Plus by
default Lucene will not merge segments that are over 5GB in size, so that
provides a kind of ceiling for the size of segment.  Overall the biggest
limit you will likely see is the 2 Billion document (Records in our case)
limit.  I think a Lucene index could exceed that, but the counts are based
on integers.  Given that I would think that you might want to calculate a
top end of 500 million per index (I have run 400+ million in a single index
in the past but keep things around 100 million now) so that you have plenty
of room before you get into the unknown limits of a single Lucene index.

At this point I think the biggest question mark in the system from want you
have described is how many shard servers you may need to have.  I have
tested Blur up to 200 shard servers in a single cluster and the controllers
seem to do just fine.  But at some point there will likely need to be some
work done to expand to 1000's of shard servers within a single cluster.
  Where is that point, I'm not sure, but I think that there may likely need
a second layer of controllers to reduce network hotspots.  But that's just
a guess.

Hmmm... Ok, so to figure out the max, I basically just need to changethis part of the calculation...


(Integer.MAX_VALUE * luceneTermIndexInterval * shardCount)

to

(Integer.MAX_VALUE * luceneTermIndexInterval * segmentsWithinIndex * shardCount)

At first, I will be starting off with probably the bare minimum hardwarewe can, but then expand from there. Typically we go with supermicro 1uxeon servers, in this situation though I understand hadoop/blur aremeant to work on commodity hardware. What effect would having good/largeservers on the same cluster as slower smaller servers? Would it stilladd to the performance, or would it cause some delays for any requestspossibly hitting that server because it doesn't have the same amount ofcpu power or memory?

My concern with this is that I have a constantly non-stop stream of newdata coming in which has to get indexed as soon as possible so that itcan be queried/displayed using various custom tools. The events persecond coming in could be 10,000, 100,000, or even higher than that persecond, perhaps 1,000,000. I currently have done as you suggested,separate it into only a few tables, rather than one every hour orsomething like I originally mentioned.Now if there is this much data coming in non-stop for years, I need totry and figure out the problems I would run into. From what I can see,table shard counts can only be set once, so if you wanted to grow thattable, I couldn't go from like 10 shards, to 100. Is this coming thatwill be possible in the future? And, if I have that much data coming inconstantly, is using a single table ok, ideal, or would it be better toseparate it up?, in which case, how often should I create new tables,and I would need the capability to query multiple tables at a time forit to work/make that feature myself/build my own post processor soaggregate them until the feature is added. At 10,000 entries (an entryconsist of 1 row with two records) per second sustained, that's just shyof 14 hours until i've hit 500 million rows in a table (7 hours until500 million records). When doing 1,000,000 rows per second, that wouldless than just 10 minutes. Although 1,000,000 rows per second is fairlyhigh, some day, I expect to hit that. For now I would expect anywherefrom 10,000 to 100,000 rows per second, which is about 1-10 hours for500 million rows.My data flow will not stop, and I need fast access to it... primarilythe newest data, but historical data has some importance as well. I amin the DDoS mitigation business though, so I take on large DDoS attackssuccessfully and log all the requests to analyze them, which is why thisis so important. For the short term data, I need to be able to see whatis going on right now on in any of the logs, then perhaps compare thatto past requests around that time of year or something.Ideally using a single table for all information would obviously make iteasy for coding purposes for me, but given the level of data, what wouldyou recommend? 1 table per month with a 100 shard count? 1 table peryear with 1000 a shard count, or perhaps just one table period with a1000 shard count and use that until it fills up, which hopefully isn'tfor a really long time... long enough time that the number of shards canbe changed or something?

If I am correct on all of this, what kind of effect does having a large
shard count (1000+)? I notice that tables when creating them seem to take
for ever in comparison to the file structure made. I've seen the create
process take 10+ second, but in the background I look, and see the file
structure created, check the disk space used while the create command is
running, and there is no change. I've tried both in hdfs and localfs. I am
doing all of this in a virtual set of sessions right now, so it may not be
the best testing platform. I'm in the process of requisitioning the
hardware though to start building the cluster at our data center. Before I
start putting all of the money into it, I want to see what kind of
limitations I may be up against if any.

The delay is likely because all 1000 of indexes are opened for writing and
their nrt threads are started up and there is a lot of ZooKeeper logic to
get everything into a known state.  Also since creating a table is not
something I have optimized for speed, it's more of a get it right process
instead of get it done fast.

So far, all the tables I have made have only a 10 shard count, so itshouldn't be that. Depending upon what you suggest above, I may needthis process to be a bit more optimized. When dealing with big data thatis constantly flowing at you, stopping for 10+ seconds to create anindex can back up to millions of entries that need to be flushed thatcould cause all kinds of problems/delays. Especially if it takes about10+ seconds for just a 10 shard count... would that mean it takes 1000+seconds for a 1000 shard count? I haven't tried anything othat than 10so far really.


Aaron

It's been a long night for me so far, and still not over, I got meetings
to go to soon. Hopefully I havent rabled on too much ;)

I have been trying to keep close with the git code so that I can deal with
the changes as they come, as well as help where I can, although I still got
a lot of code reading left to do on the blur project.

--
Thanks,
Colton McInroy

  * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

Re: Possible Limitations?

Reply via email to