Re: 24 Million entries and I need to what?

S. Dale Morrey Fri, 27 Dec 2013 09:21:22 -0800

What about if I did something like this?

cat hashes.txt | while read FILENAMES; do touch "/tmp/hashes/$FILENAMES";
done



On Fri, Dec 27, 2013 at 10:17 AM, Todd Millecam <[email protected]> wrote:

> Most disks (7200RPM) will perform slowly after about 4000 files in a single
> directory.  Not a filesystem issue, drive speed issue.
>
> Most C-written programs will outperform even the best-written java
> programs, so posix standard utilities are going to be helpful (sed, awk,
> grep) and largely up to this task if your comfortable enough to script with
> them.
>
> I'm gonna have to go with the suggestions above: stick this in a proper
> database system, or use perl/ruby/python and stick it in a hash table or
> python dictionary object.
>
> To mutate a gig file to be a python dictionary will likely take about 5-10
> minutes upfront cost, a fair bit of time (thinking about 5-10 seconds) to
> execute the import statement in the interpreter, and then to do anything
> that iterates over the whole thing will perform in the vicinity of one
> second--there are also built-in search functions for python dictionaries
> that are pretty quick (as they're written in C)
>
> For most machines (we're talking less than 3 years old) running a search on
> a hash table of 24 million will be pretty much RAM bandwidth bound--and
> DDR3 ram clocked at 1800Mhz has an effective throughput of 7-9GB/s
>
>
>
>
> On Fri, Dec 27, 2013 at 10:06 AM, S. Dale Morrey <[email protected]
> >wrote:
>
> > For some reason, god help me, but this is starting to feel like a job for
> > perl unless I can find something more sensical.
> > I tried to write a Java app and the only solution that didn't run out of
> > memory was to search it using a scanner and go line by line.
> > scanner.findWithinHorizon just puked after a few seconds.
> >
> > Search time for a single string near the end of the list was 65035 ms.
> >
> > Compare that to grep -F for the same string which seems to come in at
> 0.5s
> > and appears to get faster the more often I grep the file (no idea why
> that
> > would be, I'm using btrfs for my filesystem though).
> >
> > Still that's too dang slow.  Just 1000 hashes would take 16 minutes to
> > check.  I'm expecting to generate 1000 hashes per second.
> >
> > I do wonder what would happen if I just "touched" a file for each entry
> and
> > used the filesystem tools to see if a file by that name exists.
> > What exactly happens if you have 24 million 0 byte files in a single
> > directory on a btrfs filesystem?
> >
> >
> >
> >
> > On Fri, Dec 27, 2013 at 9:22 AM, Ed Felt <[email protected]> wrote:
> >
> > > MySQL in memory table with full indexes is probably about as fast as
> you
> > > can get.
> > > On Dec 27, 2013 9:16 AM, "Lonnie Olson" <[email protected]> wrote:
> > >
> > > > On Fri, Dec 27, 2013 at 1:59 AM, S. Dale Morrey <
> [email protected]
> > >
> > > > wrote:
> > > > > Just wondering, what would be the fastest way to do this?
> > > >
> > > > grep will have to scan the entire file every time.  Not a good idea.
> > > > You either need to store it all in memory, or use some kind of index.
> > > >
> > > > Memory options: Memcached, Redis, Custom data structure (PHP array,
> > > > Ruby Hash, Python dictionary, etc)
> > > > Indexed options: Postgres, MySQL, SQLite, BDB, etc.
> > > >
> > > > /*
> > > > PLUG: http://plug.org, #utah on irc.freenode.net
> > > > Unsubscribe: http://plug.org/mailman/options/plug
> > > > Don't fear the penguin.
> > > > */
> > > >
> > >
> > > /*
> > > PLUG: http://plug.org, #utah on irc.freenode.net
> > > Unsubscribe: http://plug.org/mailman/options/plug
> > > Don't fear the penguin.
> > > */
> > >
> >
> > /*
> > PLUG: http://plug.org, #utah on irc.freenode.net
> > Unsubscribe: http://plug.org/mailman/options/plug
> > Don't fear the penguin.
> > */
> >
>
>
>
> --
> Todd Millecam
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Re: 24 Million entries and I need to what?

Reply via email to