Possible Limitations?

Colton McInroy Fri, 18 Oct 2013 12:26:41 -0700

Hey Guys,

    I am just wondering what kind of limitations/draw backs I may run into.

Currently I am adding a row (UUID) with two records (1 and 2) forevery syslog event. I pool together the mutates and so far it appears tobe working, although there may be further optimizations I can do. Myoriginal plan was to make a table with the application and date stampthen query across the tables, but i was informed by Aaron that it iscurrently not possible, although that's something that is being lookedinto being added (any idea how long Aaron?). Instead as per Aaronssuggestion, I made a table with just the application name with the aforementioned row/records. My inquery is about what kind of things I canpossibly expect or do to avoid future problems... Such as...

Is there a limit on the number of rows in a table?

If I am going to be putting years worth of access logs for all the sitesI am logging, it's possible that I could have billions/trillions+ ofrows in a table. If I do have that many of rows, will the queries takean extremely long amount of time?Is this based off of my shard count (shart count per table or number ofshard servers)?If I know a table is going to be very large, should I set a reallylarger shard count, or does that have any bearing after a certain number?It was my understanding that lucene only supports about 274 billionunique terms per index, from my experience so far, each shard (directorynot shard server... kinda confusing between the two when trying toexplain... any plans on changing the name of the shard server tosomething else?) appears to be an index, so if that is correct, thatwould be 274*shard count which would be about 2.7 trillion for 10shards. Is that correct?

If querying across tables is going to be coming sometime soon, should Iwait and separate the tables with date stamps to keep the indexessmaller, or will using a single table with lots of shards mean I canplow through the data without a problem?If my understanding is correct, and the shard count does separate andindex into the number of shard counts, then where I would do previous 1index for each year/month/day/hour setting the shard count to 365*24would be the same number of indexes I currently create on the fly forone year (actually, I do 365*24*SITECOUNT). If I set the number ofshards to 1000 would that be a optimal number for well...1000*274billion = 274quadrillion entries.

Again, if this is all correct, for that scenario, I can see the following...

274,000,000,000,000 / 365 / 24 / 60 / 60 = 8,712,353 rows per second forone year until full

Unless it is per record, which I insert 2 for each row, then it would be....

274,000,000,000,000 / 365 / 24 / 60 / 60 / 2 = 4,356,176 records persecond for one year until full

4 million events per second is above what I am dealing with right now.But this math should help people be able to figure out what settings forwhat kind of scale... again, if this is all correct, I'll need someonelike Aaron to chime in. I end up with this calculation...


int shardCount = 10;
int recordsPerRow = 2;
int recordsPerSecond = 10000;
int luceneTermIndexInterval = 128; // Default

int yearsUntilFull = (Integer.MAX_VALUE * luceneTermIndexInterval *shardCount) / 365 / 24 / 60 / 60 / recordsPerRow / recordsPerSecond; //4.36 years until full

I just kinda threw this out there, may not work in java but I am prettysure my math is right (haven't been drinking... too much), I just did itout of my head quickly to demonstrate. In that example, it's 4.36 yearsfor 10 shards with 10,000 sustained entries per second. Add another 0 tothe number of shards then it's 100,000 sustained entries per second for4.36 years until full.

If I am correct on all of this, what kind of effect does having a largeshard count (1000+)? I notice that tables when creating them seem totake for ever in comparison to the file structure made. I've seen thecreate process take 10+ second, but in the background I look, and seethe file structure created, check the disk space used while the createcommand is running, and there is no change. I've tried both in hdfs andlocalfs. I am doing all of this in a virtual set of sessions right now,so it may not be the best testing platform. I'm in the process ofrequisitioning the hardware though to start building the cluster at ourdata center. Before I start putting all of the money into it, I want tosee what kind of limitations I may be up against if any.

It's been a long night for me so far, and still not over, I got meetingsto go to soon. Hopefully I havent rabled on too much ;)

I have been trying to keep close with the git code so that I can dealwith the changes as they come, as well as help where I can, although Istill got a lot of code reading left to do on the blur project.


--
Thanks,
Colton McInroy

 * Director of Security Engineering

        
Phone
(Toll Free)     
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

Possible Limitations?

Reply via email to