This won't make sense unless you have some background with distributed & redundant monitoring setups, but...
Years ago I was working for a company and I wrote an alerting system that did a url shortening trick so alerts could be sent over SMS. The server needed to be simplistic and resilient with as few entry/attack points as possible because this was for a major infrastructure system monitoring several thousand machines for dozens of businesses. Obviously the whole system was a prime target for attack. It would have been a disaster to have a SQL injection or some other non-sense go down so we ruled out a typical LAMP stack right away. The solution was to have nagios write the alert acknowledgment URL (shortened) to /tmp as a meta-redirect to the actual acknowledgement page (which could reside at any one of a dozen or so servers). We then used a purpose built webserver (only responded to get, would only allow a length of x characters etc) to serve the page as static HTML, letting the browser handle the redirect to the actual acknowledgment page. Since all alerts expire after a set time, and because this was just the acknowledgement URL, not the alert itself, having them disappear after a reboot was sort of a non-issue. We had 10's of thousands of shortened URLs stored in /tmp with no problems. Obviously it wasn't millions and they had actual data instead of being the result of touch, but the concept is the same. I've mentioned before that I have major issues with the box I'm on, particularly in the filesystem dept. I'm seriously of a mind that the HD may be failing. I didn't realize that there was any mirroring of /tmp to the HD and frankly it worries me that this occurred. I'm tempted to spin up a few different EC2 instances with various filesystems and repeat the experiment just to see if they all have the same issue. Obviously it's not what a filesystem is intended for, but one would tend to think that 24 million files would be a snatch for any modern filesystem to handle. I guess I look at a filesystem differently. I view any filesystem as nothing more than a database of connected blocks which are joined to represent files. I started looking at them this way when I realized that rm (or in the case of DOS del), doesn't remove the file itself, it merely removes the entry from the file allocation table/tree/whatever. Therefore it made sense in my mind to take advantage of what ought to be one of the fastest DB's around, i.e. the filesystem to solve my problems. Looks like I was wrong and I'm exploring the options. Thanks for the info about the Cuckoo hashes. On Sat, Dec 28, 2013 at 4:18 AM, Levi Pearson <[email protected]> wrote: > On Sat, Dec 28, 2013 at 1:59 AM, S. Dale Morrey <[email protected]> > wrote: > > Yep the experiment itself is a bad idea. > > Still if you had 25 million distinct entries each 32 bytes long and you > > needed the fastest lookup and comparison time possible, what would you > use? > > Well, that would depend on how much memory I had to spare. If you > load them into an array, they take up ~763MB, which is not terrible. > You can sort them in place and perform binary searches, which would > take ~25 comparisons (each composed of 1-4 64-bit compares) per > search, which is also not terrible. After sorting, any collisions > would be adjacent, so you could identify them with a single walk > through the array. Is this the fastest possible? Possibly not, but > it's sure simple to implement in C and if it's fast enough, it's fast > enough. For extra speed, you could multi-thread the binary search, > which is trivially parallelizable. > > The danger of using more advanced data structures is that the > bookkeeping starts to take a significant amount of memory, and then > (depending on your available memory) you could lose the significant > advantage of having all your data in RAM. 25 million elements is not > exactly a large number for a modern processor that operates at the > scale of billions of operations per second per core, and the hit you > take from paging could be significant under memory pressure. > > A typical hash table in a scripting language is not going to be very > space-efficient or as computationally efficient as possible. It might > be interesting to try a cuckoo hash, where one hash function for a > table of 2^n buckets simply takes the high-order n bits of the value > (it's already a hash, remember) and the other takes the low-order n > bits. Cuckoo hashes can be more space-efficient, as they can store > values directly in the buckets and sustain a load factor >80%. With > n=25, you'd have a load factor of 74.5% with 1G of memory used and two > extremely fast hash functions (assuming they work adequately for this > task, which I haven't studied in depth). > > I really think the sorted array and binary search would be sufficient > for most purposes, though, and far less work to get running and > debugged. > > > FYI putting 24 million distinct entries into a single directory even /tmp > > completely corrupted my file system to a point it wouldn't even boot. > I'm > > trying to figure out how that's possible. My understanding was that /tmp > > isn't supposed to persist after a reboot, but everything was there > anyways > > (near as I could tell). Took an hour to rm -rf that directory from a > > recovery disk. > > That's a bit scary, though it's certainly not a typical use case for a > file system. I would have recommended against that one, and I hope a > bit more thinking before trying it out would have led you to avoid it > as well. > > --Levi > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
