Hey Guys,

    I am just wondering what kind of limitations/draw backs I may run into.

Currently I am adding a row (UUID) with two records (1 and 2) for every syslog event. I pool together the mutates and so far it appears to be working, although there may be further optimizations I can do. My original plan was to make a table with the application and date stamp then query across the tables, but i was informed by Aaron that it is currently not possible, although that's something that is being looked into being added (any idea how long Aaron?). Instead as per Aarons suggestion, I made a table with just the application name with the afore mentioned row/records. My inquery is about what kind of things I can possibly expect or do to avoid future problems... Such as...
Is there a limit on the number of rows in a table?
If I am going to be putting years worth of access logs for all the sites I am logging, it's possible that I could have billions/trillions+ of rows in a table. If I do have that many of rows, will the queries take an extremely long amount of time? Is this based off of my shard count (shart count per table or number of shard servers)? If I know a table is going to be very large, should I set a really larger shard count, or does that have any bearing after a certain number? It was my understanding that lucene only supports about 274 billion unique terms per index, from my experience so far, each shard (directory not shard server... kinda confusing between the two when trying to explain... any plans on changing the name of the shard server to something else?) appears to be an index, so if that is correct, that would be 274*shard count which would be about 2.7 trillion for 10 shards. Is that correct?

If querying across tables is going to be coming sometime soon, should I wait and separate the tables with date stamps to keep the indexes smaller, or will using a single table with lots of shards mean I can plow through the data without a problem? If my understanding is correct, and the shard count does separate and index into the number of shard counts, then where I would do previous 1 index for each year/month/day/hour setting the shard count to 365*24 would be the same number of indexes I currently create on the fly for one year (actually, I do 365*24*SITECOUNT). If I set the number of shards to 1000 would that be a optimal number for well... 1000*274billion = 274quadrillion entries.
Again, if this is all correct, for that scenario, I can see the following...
274,000,000,000,000 / 365 / 24 / 60 / 60 = 8,712,353 rows per second for one year until full
Unless it is per record, which I insert 2 for each row, then it would be....
274,000,000,000,000 / 365 / 24 / 60 / 60 / 2 = 4,356,176 records per second for one year until full

4 million events per second is above what I am dealing with right now. But this math should help people be able to figure out what settings for what kind of scale... again, if this is all correct, I'll need someone like Aaron to chime in. I end up with this calculation...

int shardCount = 10;
int recordsPerRow = 2;
int recordsPerSecond = 10000;
int luceneTermIndexInterval = 128; // Default
int yearsUntilFull = (Integer.MAX_VALUE * luceneTermIndexInterval * shardCount) / 365 / 24 / 60 / 60 / recordsPerRow / recordsPerSecond; // 4.36 years until full

I just kinda threw this out there, may not work in java but I am pretty sure my math is right (haven't been drinking... too much), I just did it out of my head quickly to demonstrate. In that example, it's 4.36 years for 10 shards with 10,000 sustained entries per second. Add another 0 to the number of shards then it's 100,000 sustained entries per second for 4.36 years until full.

If I am correct on all of this, what kind of effect does having a large shard count (1000+)? I notice that tables when creating them seem to take for ever in comparison to the file structure made. I've seen the create process take 10+ second, but in the background I look, and see the file structure created, check the disk space used while the create command is running, and there is no change. I've tried both in hdfs and localfs. I am doing all of this in a virtual set of sessions right now, so it may not be the best testing platform. I'm in the process of requisitioning the hardware though to start building the cluster at our data center. Before I start putting all of the money into it, I want to see what kind of limitations I may be up against if any.

It's been a long night for me so far, and still not over, I got meetings to go to soon. Hopefully I havent rabled on too much ;)

I have been trying to keep close with the git code so that I can deal with the changes as they come, as well as help where I can, although I still got a lot of code reading left to do on the blur project.

--
Thanks,
Colton McInroy

 * Director of Security Engineering

        
Phone
(Toll Free)     
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

Reply via email to