I used the new DataStorage API that showed up as part of PIG-32 a fair bit this weekend while updating PIG-55, and I'm a little perplexed by its design. A few questions about why things are the way they are follow. I'd be happy to make some patches to address these issues, but I wanted to make sure I'm not missing something first.

Why are the navigation functions on DataStorage and not ContainerDescriptor? It seems natural to add a couple methods to ContainerDescriptor to get a subelement or subcontainer given a String. The current setup seems to require calling getDataStorage on the Container then calling asContainer or asElement on it with the same Container as an argument. If the navigation moves to ContainerDescriptor, DataStorage could just have a single method to get an ElementDescriptor given a String rather than its current proliferation of as* methods.

Why does ContainerDescriptor extend ElementDescriptor? ElementDescriptor exposes several methods that make no sense for a directory, so this forces every ContainerDescriptor implementation to disallow those methods and return a dummy InputStream for create. A common superinterface with the shared operations would make things much easier for DataStorage implementors.

Why does ContainerDescriptor only expose listing its subelements through being an Iterable? Having it be Iterable is definitely nice, but there are always times when you need to look at all the files at once, so this forces any client code to build an array by hand out of the Iterable. Since both of the existing implementations are already turning a returned array into an iterable, why not expose that and save some work for clients?

What's the distinction between getConfiguration and getStatistics? Is it that the things in Configuration are settable and Statistics aren't? If that's the case, why not just have a getProperties method and note if a given key is settable in its javadoc. A user is already going to have to lookup the key to figure out how a given DataStorage implementation's configuration maps into the common DataStorage operations.

Could keys common to all DataStorage implementations be moved to methods on ElementDescriptor? The existing keys all seem like they'd be available from any DataStorage, so making them regular methods on the ElementDescriptor would make them much more pleasant to use not to mention that it'd remove the need to create a Map for every access to these rather commonly used attributes.

Is toString the correct method to produce a String representation of a ElementDescriptor? I didn't see anything else on there to produce an absolute String representation of a path, and that's hugely useful for serialization. It seems like a bad idea to expose that through toString since toString is generally used for debugging, and there's nothing to guide DataStorage implementors to use toString as such.

The last thing that's bothering me about the API is the names of the interfaces: ElementDescriptor and ContainerDescriptor. Those names tell me almost nothing about what the interfaces do. Container gives me a little hint that that interface will probably have other things inside of it, but the other two words are generic enough in programming to be meaningless. I realize that something other than a filesystem may be exposed through these interfaces, but the operations exposed through the interfaces are inherently file-like, so calling them something like PigFile and PigDirectory would convey loads more information about how they're to be used to a programmer encountering them for the first time.

Charlie

Reply via email to