chunkd design notes (was Re: HAIL volunteer Rick Peralta)

Jeff Garzik Fri, 31 Jul 2009 14:08:45 -0700

Rick Peralta wrote:

Hi All,


Thanks for inviting me to the forum and thanks to you all for making things 
happen!

My father said, "don't change anything unless you know why".  Those words ring 
in my ears more and more after decades of System development.  It is my intention and 
hope to respect the wisdom of those words and be clear about what the objectives of any 
endeavor is (including sloth ;^).


Yes, that is pretty much the Linux mantra :)

Well that, along with "do what you must, and no more" (implying, don'ttry and predict the future, don't over-design)

The chunkd effort caught my eye for a variety of reasons.  It is functionally 
very much like something I advocated for a long time ago, it is a relatively 
simple, yet powerful machine and it may benefit by some redesign for 
performance (my personal specialty).

The question at hand is: What truly needs to be done?  Bugs are bugs and one 
can debate one solution over another, but in the end it's about getting things 
to work well.  Multithreading the transport layer is probably a good idea, but 
some diligence should be paid to why.  There are any number of other open 
issues that also deserve some attention.  Coding is fine, but understanding 
what and why seems to be a first step.


The current, version 1.0 design goals for chunkd are

* multiple worker threads, because I/O parallelism

        - is the only way to max out storage hardware command and
          completion queues
        - enables greater optimizations on a TCQ/NCQ-enabled storage
          device, compared to slower command-at-a-time solutions

* no internal data caching; leverage kernel pagecache

* use POSIX filesystem API for our "database"; avoid sql, db4, sqlite, etc.

But I am very open to other design requirements or suggestions. Speakup! :)

In order to have a common basis for evaluation I'd like to suggest a standard 
platform to consider in the context of discussions.  The current implementation of 
chunkd, running on a standard server (probably with a 32 bit address space), with 
gigabit Ethernet, and a single disk (good for about 25 MB/s & 15 ms seek time). 
 Consideration of more or different bulk storage, 10 Gbe, IB or other high 
bandwidth implementations and so forth can be considered as branches from the core 
model.

Yeah, the "standard platform model" is generally a "1U data centerserver", which probably equates to a physical or virtualized instanceof: single multi-core CPU, 2-4GB RAM, gige, single ATA disk.


That example lends itself to 1000's of such chunkd storage nodes.

But it is also a valid minority model to have a handful of _huge_ chunkdnodes, perhaps tied to 10gige and SAN networks.

Given the current implementation of chunkd, it generally resides in user space, 
over a standard file system (complete with caches, overhead and whatever else 
comes along).


Correct.

PZ>
I have some short list todo for Chunk, after which I don't have
any particular plans:
 * Exit if CLD registration fails (maybe!).
 * Put ourhost into the CLD record, and the port.
 * Use base directory instead of Cell.
 * Switch to asprintf for CLD filenames, Geo.

FD>
Yes. I also think that chunkd should not do it's own replication. As the
strategy may be domain/application dependend. Therefor I'd appreciate if
chunkd would provide some kind of "copy(dst,sha)" function, to be able
to directly copy to another chunkd instance.

JG>
Hopefully all this is wrapped up into libcldc...

JG>
* total single-node volume size:  one cheap SATA hard drive
* total number of chunks:  ==
        total number of tabled objects / number of storage nodes
* distribution of chunk sizes:  dependent upon the application using tabled
* aggregate bandwidth:  dependent upon the application using tabled

fbp>
Might we put some numbers to this?
Most notable is typical chunk size and number of supported clients.

You can make up some numbers, this chunk size and client count are twothings that really will vary _wildly_ from application to application.

A distributed filesystem like Hadoop DFS / GoogleFS / CloudStore couldhave thousands of clients talking to a single chunkd node, becauseclients of those DFS's directly connect to the storage nodes.

NFS v4.1 also specifies a parallel storage model, where clients connectdirectly to the storage node storing the client's desired data.

In contrast, our current tabled design does not permit end-user clientsto directly connect to chunkd storage. That implies hundreds orthousands of chunkd nodes, with 0-5 actively connected clients.


Regards,

        Jeff



--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

chunkd design notes (was Re: HAIL volunteer Rick Peralta)

Reply via email to