Thanks,
Colton McInroy
* Director of Security Engineering
Phone
(Toll Free)
_US_ (888)-818-1344 Press 2
_UK_ 0-800-635-0551 Press 2
My Extension 101
24/7 Support [email protected] <mailto:[email protected]>
Email [email protected] <mailto:[email protected]>
Website http://www.dosarrest.com
On 10/19/2013 6:38 AM, Aaron McCurry wrote:
Colton,
You always have great emails. :-)
Thank you :)
I will attempt to add some value beyond Garrett's email.
Thanks, he did a fairly good job at answering a chunk of them.
Ok, so Lucene's limits are a little different than what you have described
(I think). Based on https://issues.apache.org/jira/browse/LUCENE-2257 the
limit is not per index but instead per segment within the index. Plus by
default Lucene will not merge segments that are over 5GB in size, so that
provides a kind of ceiling for the size of segment. Overall the biggest
limit you will likely see is the 2 Billion document (Records in our case)
limit. I think a Lucene index could exceed that, but the counts are based
on integers. Given that I would think that you might want to calculate a
top end of 500 million per index (I have run 400+ million in a single index
in the past but keep things around 100 million now) so that you have plenty
of room before you get into the unknown limits of a single Lucene index.
At this point I think the biggest question mark in the system from want you
have described is how many shard servers you may need to have. I have
tested Blur up to 200 shard servers in a single cluster and the controllers
seem to do just fine. But at some point there will likely need to be some
work done to expand to 1000's of shard servers within a single cluster.
Where is that point, I'm not sure, but I think that there may likely need
a second layer of controllers to reduce network hotspots. But that's just
a guess.
Hmmm... Ok, so to figure out the max, I basically just need to change
this part of the calculation...
(Integer.MAX_VALUE * luceneTermIndexInterval * shardCount)
to
(Integer.MAX_VALUE * luceneTermIndexInterval * segmentsWithinIndex * shardCount)
At first, I will be starting off with probably the bare minimum hardware
we can, but then expand from there. Typically we go with supermicro 1u
xeon servers, in this situation though I understand hadoop/blur are
meant to work on commodity hardware. What effect would having good/large
servers on the same cluster as slower smaller servers? Would it still
add to the performance, or would it cause some delays for any requests
possibly hitting that server because it doesn't have the same amount of
cpu power or memory?
My concern with this is that I have a constantly non-stop stream of new
data coming in which has to get indexed as soon as possible so that it
can be queried/displayed using various custom tools. The events per
second coming in could be 10,000, 100,000, or even higher than that per
second, perhaps 1,000,000. I currently have done as you suggested,
separate it into only a few tables, rather than one every hour or
something like I originally mentioned.
Now if there is this much data coming in non-stop for years, I need to
try and figure out the problems I would run into. From what I can see,
table shard counts can only be set once, so if you wanted to grow that
table, I couldn't go from like 10 shards, to 100. Is this coming that
will be possible in the future? And, if I have that much data coming in
constantly, is using a single table ok, ideal, or would it be better to
separate it up?, in which case, how often should I create new tables,
and I would need the capability to query multiple tables at a time for
it to work/make that feature myself/build my own post processor so
aggregate them until the feature is added. At 10,000 entries (an entry
consist of 1 row with two records) per second sustained, that's just shy
of 14 hours until i've hit 500 million rows in a table (7 hours until
500 million records). When doing 1,000,000 rows per second, that would
less than just 10 minutes. Although 1,000,000 rows per second is fairly
high, some day, I expect to hit that. For now I would expect anywhere
from 10,000 to 100,000 rows per second, which is about 1-10 hours for
500 million rows.
My data flow will not stop, and I need fast access to it... primarily
the newest data, but historical data has some importance as well. I am
in the DDoS mitigation business though, so I take on large DDoS attacks
successfully and log all the requests to analyze them, which is why this
is so important. For the short term data, I need to be able to see what
is going on right now on in any of the logs, then perhaps compare that
to past requests around that time of year or something.
Ideally using a single table for all information would obviously make it
easy for coding purposes for me, but given the level of data, what would
you recommend? 1 table per month with a 100 shard count? 1 table per
year with 1000 a shard count, or perhaps just one table period with a
1000 shard count and use that until it fills up, which hopefully isn't
for a really long time... long enough time that the number of shards can
be changed or something?
If I am correct on all of this, what kind of effect does having a large
shard count (1000+)? I notice that tables when creating them seem to take
for ever in comparison to the file structure made. I've seen the create
process take 10+ second, but in the background I look, and see the file
structure created, check the disk space used while the create command is
running, and there is no change. I've tried both in hdfs and localfs. I am
doing all of this in a virtual set of sessions right now, so it may not be
the best testing platform. I'm in the process of requisitioning the
hardware though to start building the cluster at our data center. Before I
start putting all of the money into it, I want to see what kind of
limitations I may be up against if any.
The delay is likely because all 1000 of indexes are opened for writing and
their nrt threads are started up and there is a lot of ZooKeeper logic to
get everything into a known state. Also since creating a table is not
something I have optimized for speed, it's more of a get it right process
instead of get it done fast.
So far, all the tables I have made have only a 10 shard count, so it
shouldn't be that. Depending upon what you suggest above, I may need
this process to be a bit more optimized. When dealing with big data that
is constantly flowing at you, stopping for 10+ seconds to create an
index can back up to millions of entries that need to be flushed that
could cause all kinds of problems/delays. Especially if it takes about
10+ seconds for just a 10 shard count... would that mean it takes 1000+
seconds for a 1000 shard count? I haven't tried anything othat than 10
so far really.
Aaron
It's been a long night for me so far, and still not over, I got meetings
to go to soon. Hopefully I havent rabled on too much ;)
I have been trying to keep close with the git code so that I can deal with
the changes as they come, as well as help where I can, although I still got
a lot of code reading left to do on the blur project.
--
Thanks,
Colton McInroy
* Director of Security Engineering
Phone
(Toll Free)
_US_ (888)-818-1344 Press 2
_UK_ 0-800-635-0551 Press 2
My Extension 101
24/7 Support [email protected] <mailto:[email protected]>
Email [email protected] <mailto:[email protected]>
Website http://www.dosarrest.com