a couple of thoughts:

Instead, I think we need the following two functions:

tOffset getBlockSize(dfsFs fs);

char** geHosts(dfsFs fs, char* file, tOffset pos);

** Think your suggestion is good. An addition... I'd rather not assume that block size is global. Why not require a file name in the getBlockSize call? This may prove more future proof.

** Why not wait on the dfsCopy/Move type commands? These can be implemented via a systems call to a commandline tool already, right?

On Apr 26, 2006, at 9:56 AM, Doug Cutting wrote:

Devaraj Das wrote:
Attached is a draft of the C API specification that some of us (in Yahoo) have been thinking about. The specification is closely tied to the API exported by Hadoop's FileSystem class. Will really appreciate any comments, etc. on the specification.

Overall, this looks great!  Thanks for working on this!

  /**   * dfsFileLocationInfo
* used to get the mapping between file blocks and the hostnames where * they are stored. Due to replication, a file block could be stored on
  * multiple hosts.
  */
  typedef struct  {
    char **hostname;
    int numHosts;
  } dfsFileLocationInfo;
  /**   * dfsStat
  * used for getting information about a file/directory
  */
  typedef struct  {
    tObjectKind mKind;  /** file or directory */
    char *mName; /* the name of the file */
    tTime mCreationTime;
dfsFileLocationInfo *fileLocationInfo; /*the last element in the array is NULL*/
    long  mSize; /*the size of the file in bytes */
    bool replicated; /*whether this file is replicated */
  } dfsFileInfo;
/** return information about a path as a (dynamically allocated) array * of dfsFileInfo.
  * numEntries is set to the number of elements in the array.
* If the path happens to be a file, the array will have just one element. * If the path happens to be a directory, the dfsFileInfo elements in the * array will contain information about the files/sub-dirs within the path. * NULL is returned if the path does not exist or some other error is * encountered. freeDfsFileInfo should be called passing the array and * numEntries when it is no longer needed.
  */
  dfsFileInfo *dfsGetPathInfo(dfsFS fs, char *path, int *numEntries);

I'm a little confused about the dfsFileLocationInfo. It exposes too much of the filesystem internals, that applications don't require. It's also expensive to return full block lists with directory listings.

Instead, I think we need the following two functions:

tOffset getBlockSize(dfsFs fs);

char** geHosts(dfsFs fs, char* file, tOffset pos);

This would return an array of hosts that contain the specified position in a file. Does that make sense?

  int dfsCopyFromLocalFile(dfsFs fs, char *src, char *dst);
  int dfsCopyToLocalFile(dfsFs fs, char *src, char *dst);
  int dfsMoveFromLocalFile(dfsFs fs, char *src, char *dst);

These are utility methods, that could be implemented by user code, i.e., not core methods. That's fine. But perhaps we should add another:

int dfsCopy(dfsFs fs, char* src, char* dst);

Otherwise lots of applications will end up writing this themselves.

Thanks again,

Doug

Reply via email to