On 12 February 2013 22:09, Eli Collins <[email protected]> wrote: > I agree that the current place isn't a good one, for both the reasons > you mention on the jira (and because the people maintaining this code > don't primarily work on Hadoop). IMO the SwiftFS driver should live in > the swift source tree (as part of open stack). >
If they could be persuaded to move beyond .py, it'd be tempting -because the FileSystem API is nominally stable. However, one thing I have noticed during this work is how the behaviour of FileSystem is underspecified -that's not an issue for HDFS, which gets stressed rigorously during the hdfs and mapred test runs, but it does matter for the rest. There's a lot of assumptions "files!=directories", mv / anything fails, and things that aren't tested (mv self self) returns true if self is file, false if a directory, what exception to raise if readFully goes past the end of a file (and the answer is?). We even make an implicit assumption that file operations are consistent: you get back what you wrote, which turns out to be an assumption not guaranteed by any of the blobstores in all circumstances. HADOOP-9258, HADOOP-9119 tighten the spec a bit, but if you look at what I've been doing for Swift testing, I've created a set of test suites, one per operation "ls", "read", "rename", with tests for scale, directory depth and width on my todo list: https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift Then I want to extract those into tests that can be applied to all filesystems (say in o.a.g.fs.contract), with some per-FS metadata file providing details on what the FS supports (rename, append, case sensitivity, MAX_PATH, ...), so that we've got better test coverage (& being Junit4, you can skip tests in-code by throwing AssumptionViolatedExceptions; these get reported as skips), test coverage that can be applied to all the filesystems in the hadoop codebase. It's this expanded test coverage that will be the tightest coupling to hadoop. > > I'm not -1 on it living in-tree, it's just not my 1st choice. If you > want to create a top-level directory for 3rd party (read non-local, > non-hdfs file systems) file systems - go for it. It would be an > improvement on the current situation (o.a.h.fs.ftp also brings in > dependencies that most people don't need). I don't think we need to > come up with a new top-level "kitchen sink" directory to handle all > Hadoop extensions, there are a few well-defined extension points that > can likely be handled independently so logically grouping them > separately makes sense to me (and perhaps we'll decide some extensions > are better in-tree and some not). > Makes sense. That I will do in a JIRA
