Sorting it was based on the premise that I could load it into a hashmap or dictionary type structure. Unfortunately I keep running out of ram with every Java method I can think of to store this thing. My initial thought was a HashSet, but that didn't work at all, due to running out of memory.
Splitting it out and recreating it as a bunch of files just gives me a hack where I can check to see if the file exists. In a RAM disk such as /tmp I'm guessing that this would no longer be bound to the spin rate of the disk. I have 4GB of RAM, my OS is using 1.5GB of that. I'm hoping that I won't suddenly deprive my system of RAM by dumping the contents to /tmp. On Fri, Dec 27, 2013 at 10:21 AM, John Shaver <[email protected]> wrote: > What was the point of sorting it if you're not going to use something like > a binary search? > > If you're going to use files, split the files by first, then first and > second, then first second and third character in the hash and save them > into a corresponding tree of directories. > > -John > > > On Fri, Dec 27, 2013 at 10:06 AM, S. Dale Morrey <[email protected] > >wrote: > > > For some reason, god help me, but this is starting to feel like a job for > > perl unless I can find something more sensical. > > I tried to write a Java app and the only solution that didn't run out of > > memory was to search it using a scanner and go line by line. > > scanner.findWithinHorizon just puked after a few seconds. > > > > Search time for a single string near the end of the list was 65035 ms. > > > > Compare that to grep -F for the same string which seems to come in at > 0.5s > > and appears to get faster the more often I grep the file (no idea why > that > > would be, I'm using btrfs for my filesystem though). > > > > Still that's too dang slow. Just 1000 hashes would take 16 minutes to > > check. I'm expecting to generate 1000 hashes per second. > > > > I do wonder what would happen if I just "touched" a file for each entry > and > > used the filesystem tools to see if a file by that name exists. > > What exactly happens if you have 24 million 0 byte files in a single > > directory on a btrfs filesystem? > > > > > > > > > > On Fri, Dec 27, 2013 at 9:22 AM, Ed Felt <[email protected]> wrote: > > > > > MySQL in memory table with full indexes is probably about as fast as > you > > > can get. > > > On Dec 27, 2013 9:16 AM, "Lonnie Olson" <[email protected]> wrote: > > > > > > > On Fri, Dec 27, 2013 at 1:59 AM, S. Dale Morrey < > [email protected] > > > > > > > wrote: > > > > > Just wondering, what would be the fastest way to do this? > > > > > > > > grep will have to scan the entire file every time. Not a good idea. > > > > You either need to store it all in memory, or use some kind of index. > > > > > > > > Memory options: Memcached, Redis, Custom data structure (PHP array, > > > > Ruby Hash, Python dictionary, etc) > > > > Indexed options: Postgres, MySQL, SQLite, BDB, etc. > > > > > > > > /* > > > > PLUG: http://plug.org, #utah on irc.freenode.net > > > > Unsubscribe: http://plug.org/mailman/options/plug > > > > Don't fear the penguin. > > > > */ > > > > > > > > > > /* > > > PLUG: http://plug.org, #utah on irc.freenode.net > > > Unsubscribe: http://plug.org/mailman/options/plug > > > Don't fear the penguin. > > > */ > > > > > > > /* > > PLUG: http://plug.org, #utah on irc.freenode.net > > Unsubscribe: http://plug.org/mailman/options/plug > > Don't fear the penguin. > > */ > > > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
