Most disks (7200RPM) will perform slowly after about 4000 files in a single directory. Not a filesystem issue, drive speed issue.
Most C-written programs will outperform even the best-written java programs, so posix standard utilities are going to be helpful (sed, awk, grep) and largely up to this task if your comfortable enough to script with them. I'm gonna have to go with the suggestions above: stick this in a proper database system, or use perl/ruby/python and stick it in a hash table or python dictionary object. To mutate a gig file to be a python dictionary will likely take about 5-10 minutes upfront cost, a fair bit of time (thinking about 5-10 seconds) to execute the import statement in the interpreter, and then to do anything that iterates over the whole thing will perform in the vicinity of one second--there are also built-in search functions for python dictionaries that are pretty quick (as they're written in C) For most machines (we're talking less than 3 years old) running a search on a hash table of 24 million will be pretty much RAM bandwidth bound--and DDR3 ram clocked at 1800Mhz has an effective throughput of 7-9GB/s On Fri, Dec 27, 2013 at 10:06 AM, S. Dale Morrey <[email protected]>wrote: > For some reason, god help me, but this is starting to feel like a job for > perl unless I can find something more sensical. > I tried to write a Java app and the only solution that didn't run out of > memory was to search it using a scanner and go line by line. > scanner.findWithinHorizon just puked after a few seconds. > > Search time for a single string near the end of the list was 65035 ms. > > Compare that to grep -F for the same string which seems to come in at 0.5s > and appears to get faster the more often I grep the file (no idea why that > would be, I'm using btrfs for my filesystem though). > > Still that's too dang slow. Just 1000 hashes would take 16 minutes to > check. I'm expecting to generate 1000 hashes per second. > > I do wonder what would happen if I just "touched" a file for each entry and > used the filesystem tools to see if a file by that name exists. > What exactly happens if you have 24 million 0 byte files in a single > directory on a btrfs filesystem? > > > > > On Fri, Dec 27, 2013 at 9:22 AM, Ed Felt <[email protected]> wrote: > > > MySQL in memory table with full indexes is probably about as fast as you > > can get. > > On Dec 27, 2013 9:16 AM, "Lonnie Olson" <[email protected]> wrote: > > > > > On Fri, Dec 27, 2013 at 1:59 AM, S. Dale Morrey <[email protected] > > > > > wrote: > > > > Just wondering, what would be the fastest way to do this? > > > > > > grep will have to scan the entire file every time. Not a good idea. > > > You either need to store it all in memory, or use some kind of index. > > > > > > Memory options: Memcached, Redis, Custom data structure (PHP array, > > > Ruby Hash, Python dictionary, etc) > > > Indexed options: Postgres, MySQL, SQLite, BDB, etc. > > > > > > /* > > > PLUG: http://plug.org, #utah on irc.freenode.net > > > Unsubscribe: http://plug.org/mailman/options/plug > > > Don't fear the penguin. > > > */ > > > > > > > /* > > PLUG: http://plug.org, #utah on irc.freenode.net > > Unsubscribe: http://plug.org/mailman/options/plug > > Don't fear the penguin. > > */ > > > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > -- Todd Millecam /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
