a couple of thoughts:
Instead, I think we need the following two functions:
tOffset getBlockSize(dfsFs fs);
char** geHosts(dfsFs fs, char* file, tOffset pos);
** Think your suggestion is good. An addition... I'd rather not
assume that block size is global. Why not require a file name in the
getBlockSize call? This may prove more future proof.
** Why not wait on the dfsCopy/Move type commands? These can be
implemented via a systems call to a commandline tool already, right?
On Apr 26, 2006, at 9:56 AM, Doug Cutting wrote:
Devaraj Das wrote:
Attached is a draft of the C API specification that some of us (in
Yahoo)
have been thinking about. The specification is closely tied to the
API
exported by Hadoop's FileSystem class. Will really appreciate any
comments, etc. on the specification.
Overall, this looks great! Thanks for working on this!
/** * dfsFileLocationInfo
* used to get the mapping between file blocks and the hostnames
where
* they are stored. Due to replication, a file block could be
stored on
* multiple hosts.
*/
typedef struct {
char **hostname;
int numHosts;
} dfsFileLocationInfo;
/** * dfsStat
* used for getting information about a file/directory
*/
typedef struct {
tObjectKind mKind; /** file or directory */
char *mName; /* the name of the file */
tTime mCreationTime;
dfsFileLocationInfo *fileLocationInfo; /*the last
element in the array is NULL*/
long mSize; /*the size of the file in bytes */
bool replicated; /*whether this file is replicated */
} dfsFileInfo;
/** return information about a path as a (dynamically allocated)
array * of dfsFileInfo.
* numEntries is set to the number of elements in the array.
* If the path happens to be a file, the array will have just one
element.
* If the path happens to be a directory, the dfsFileInfo
elements in the
* array will contain information about the files/sub-dirs within
the path.
* NULL is returned if the path does not exist or some other
error is * encountered. freeDfsFileInfo should be called passing
the array and * numEntries when it is no longer needed.
*/
dfsFileInfo *dfsGetPathInfo(dfsFS fs, char *path, int *numEntries);
I'm a little confused about the dfsFileLocationInfo. It exposes
too much of the filesystem internals, that applications don't
require. It's also expensive to return full block lists with
directory listings.
Instead, I think we need the following two functions:
tOffset getBlockSize(dfsFs fs);
char** geHosts(dfsFs fs, char* file, tOffset pos);
This would return an array of hosts that contain the specified
position in a file. Does that make sense?
int dfsCopyFromLocalFile(dfsFs fs, char *src, char *dst);
int dfsCopyToLocalFile(dfsFs fs, char *src, char *dst);
int dfsMoveFromLocalFile(dfsFs fs, char *src, char *dst);
These are utility methods, that could be implemented by user code,
i.e., not core methods. That's fine. But perhaps we should add
another:
int dfsCopy(dfsFs fs, char* src, char* dst);
Otherwise lots of applications will end up writing this themselves.
Thanks again,
Doug