Re: [Beowulf] since we are talking about file systems ...

PS Tue, 24 Jan 2006 17:21:43 -0800

You're absolutely correct, of course, in what you state: Google doesn'taccess millions of files in a split second. But Google finds one file inmillions in a split second.

I still think, though, that indexing makes sense on a PC - withpractically any underlying file system. More than ten years ago I used alarge (for that time) database of about 200 million files - which wasCDROM based; the file system was PC Dos and the indexing was done on topof that in a sort of front end. To find/access any one of the 200 miofiles on the slow CDROM took never longer than 0.8 seconds. So, wedidn't touch the "kernel" of dos at all and still got fast results. Ithink a similar approach should work on a Linux system. In my opinionthis could and should be done outside of the kernel. What do you think?


Paul


Robert G. Brown wrote:

On Sun, 22 Jan 2006, PS wrote:

Indexing is the key; observe how Google accesses millions of filesin split seconds; this could easily be achieved in a PC file system.



I think that you mean the right thing, but you're saying it in a very
confusing way.

1) Google doesn't access millions of files in a split second, it AFAIK
accesses relatively few files that are hashes (on its "index server")
that lead to URLs in a split second WITHOUT actually traversing millions
of alternatives (as you say, indexing is the key:-).  File access
latency on a physical disk makes the former all but impossible without
highly specialized kernel hacks/hooks, ramdisks, caches, disk arrays,
and so on.  Even bandwidth would be a limitation if one assumes block
I/O with a minimum block size of 4K -- 4K x 1M -> 4 Gigabytes/second
(note BYTES, not bits) exceeds the bandwidth of pretty much any physical
medium except maybe memory.

2) It cannot "easily" be achieved in a PC file system, if by that you
mean building an actual filesystem (at the kernel level) that supports
this sort of access.  There is a lot more to a scalable, robust,
journalizeable filesystem than directory lookup capabilities.  A lot of
Google's speed comes from being able to use substantial parallelism on a
distributed server environment with lots of data replication and
redundancy, a thing that is impossible for a PC filesystem with a number
of latency and bandwidth bottlenecks at different points in the dataflow
pathways towards what is typically a single physical disk on a single
e.g.  PCI-whatever channel.

I think that what you mean (correctly) is that this is something that
"most" user/programmers would be better off trying to do in userspace on
top of any general purpose, known reliable/robust/efficient PC
filesystem, using hashes customized to the application.  When I first
read your reply, though, I read it very differently as saying that it
would be easy to build a linux filesystem that actually permits millions
of files per second to be accessed and that this is what Google does
operationally.

   rgb

Paul


Joe Landman wrote:
Methinks I lost lots of folks with my points ...
Major thesis is that on well designed hardware/software/filesystems,50000 files is not a problem for accesses (though from a managementpoint of view it is a nightmare). For poorly designed/implementedfile systems it is a nightmare.
Way back when in the glory days of SGI, I seem to remember xfs beingtested with millions of files per directory (though don't hold me tothat old memory). Call this hearsay at this moment.
A well designed and implemented file system shouldn't bog you downas you scale out in size, even if you shouldn't. Its sort of likeyour car. If you go beyond 70 MPH somewhere in the US that supportssuch speeds, your transmission shouldn't just drop out because youhit 71 MPH.
Graceful degradation is a good thing.

Joe


_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] since we are talking about file systems ...

Reply via email to