Honestly, if we just treat subregions as independent datasets without special attributes, things can get a little hairy. Right now subregions by themselves are defined as simple bounding boxes (or in the future more complex polygons). However these are not tied to one particular dataset, they are instead applied to multiple datasets. The proposed SubregionDataset, on the other hand, must be tied to a parent Dataset instance, because subsetted data isn't very useful without the original context. I think this depends on what is meant by "grouping" here, but if we for example had some function in dataset_processor that spits out the subsetted datasets and then pass them in to the evaluation later as generic datasets, how will the user know which dataset the subsetted data originally belonged to without messing around with the name attribute?
If we decide that a Dataset subclass for subregions is irrelevant and keep one generic Dataset class, allow me to propose the following modification: >From a computer science perspective I think it might be most convenient to think of subregion datasets as a tree data structure, with the original dataset as the root node. This type of implementation would make it possible to have several levels of complexity (eg subregions of subregions) but for the sake of argument let's just say that our tree always has a depth of one since I don't think any scientist would need more than that. This could be implemented by giving each Dataset a subregions attribute containing a dictionary of datasets keyed by subregion name, and of course the values are just new datasets (ideally making sure that dataset lats, lons, and values are views of their corresponding attributes in the root dataset and not deep copies). Then we could define our metrics so that they could work recursively on each dataset, first running on the entire thing, then repeating on each child (subregion). A potential disadvantage I might see from this is then figuring out the best way to structure the result of the metric. You also need to decide how to then add new subregions to the dataset, either directly to the dictionary and/or argument in the constructor or solely through dataset_processor. Alternatively it would still be fine to make a new subclass of Dataset and have the primary attributes be multidimensional arrays containing the subsetted data, but doing this might be less elegant since you would then need to process it differently in the evaluation. Any more thoughts? The tree idea was merely something I just recently came up with out of the blue, so I am not sure if I covered all the possible drawbacks. Thanks, Alex On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]> wrote: > I think the most important thing that we need to decide on is: > > Do we treat subregions of a dataset as a single object. As in > >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions, > aDataset) > > In this case, what would aSubregionedDataset look like? Would it have a > list of lat lists? One list for each subregion that was taken from the > dataset? > > or is a subregion effectively just a dataset? As in > > >>> [firstDataset, secondDataset, ..., nthDataset] = > DatasetProcessor.subregion(someNSubregions, aDataset) > > If we go with the second approach, I don't see the point in making a > distinction. If a subregion is really just a subset of a dataset then > there's not purpose in separating the two. The user should be responsible > for properly grouping datasets (that happen to be subregions) into the > Evaluation and passing them in the expected grouping for plotting. In this > case, the Evaluation object doesn't treat a subregion differently at all. > It's just a Dataset that get's run through everything like normal. > > If the only purpose for making a special distinction between a subregion > and a dataset is for grouping convenience then we really need to ask > ourselves if the user should be responsible for handling the grouping so we > can simplify the system. Personally, I think the user should be responsible > for this work. However, that's only because as far as I can tell a > subregion is just a Dataset with an adjusted bounding box. Perhaps I'm > oversimplifying. > > > -- Joyce > > > On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale <[email protected]> > wrote: > > > I think option 3 is the best given the rationale that has been stated > > previously. > > > > I can add a function to the dataset_processor module that will take in a > > single Dataset Object and a list of SubRegion Specifications (north, > south, > > east, west, Name), and it could return a tuple of SubRegion objects with > a > > length equal to the number of SubRegion Specs. > > > > SubClassing Dataset makes sense because a Dataset and SubRegion share > > common attributes, but after talking with Mike about the two, can a > future > > science user please give me a clear difference between a Dataset and a > > SubRegion? > > > > I hope a SubRegion assumes specific Metrics to be run, that cannot be run > > on a Dataset. I fear if SubRegion and Dataset are too similar it will > > merely confuse users (and software engineers) about when to use which > one. > > > > Can anyone articulate the difference between a Dataset and SubRegion for > > me? > > > > > > Thanks, > > > > > > Cameron > > > > > > On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]> > wrote: > > > > > You covered most everything Alex. > > > > > > I'm a fan of inheriting from Dataset to handle Subregions. The user can > > > still add the "dataset" the same way to an Evaluation. Then the > > Evaluation > > > instance can run a separate eval loop to handle subregions. It makes > > > Evaluation more complicated but using naming convention to designate a > > > subregion will just be worse I feel. The DatasetProcessor could have a > > > function that takes a Dataset and subregion information and spits out a > > new > > > SubregionDataset (or some such meaningful name) instance that the user > > can > > > add to the Evaluation. > > > > > > What does everyone think would be a good way of handling this? > > > > > > > > > -- Joyce > > > > > > > > > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander (398J-Affiliate) < > > > [email protected]> wrote: > > > > > > > Hi all, > > > > > > > > Being able to account for subregions will be a crucial part of > running > > an > > > > evaluation and making the right plots as part of our OCW refactoring. > > > Mike > > > > and I had a discussion last Friday on some ways to do this and we > both > > > > thought that the best approach would make use of the Dataset class > > > somehow. > > > > Some specific ideas we had include: > > > > > > > > 1) Designate datasets as subregional by convention. Specifically, > this > > > > could be something like making a new dataset instance with the same > > name > > > as > > > > the parent dataset but with the subregion name appended to the end > > with a > > > > leading underscore (eg name_R01, name_R02). > > > > > > > > 2) Values for a particular subregion could placed in a list or > > dictionary > > > > as an attribute of Dataset. > > > > > > > > 3) Make a subclass of Dataset explicitly for subregions. > > > > > > > > In general, any approach will add an additional complication to some > > > > component of the new OCW code in that the evaluation results / > datasets > > > > need to get grouped together by subregion. > > > > > > > > My preferred approach is (3) since it adds the least amount of > > > complication > > > > to the plotting. I particularly don't like (1) since enforcing a rule > > by > > > > convention would add restrictions to users on valid names for > datasets, > > > for > > > > example a dataset name like 'TRMM_hourly_precip' would make it > > difficult > > > to > > > > incorporate subregions. > > > > > > > > Mike, my memory since our last meeting is a bit fuzzy so please > clarify > > > or > > > > correct any of my points if I am wrong here. I would like to hear > other > > > > ideas or opinions as to the best approach for the subregion problem. > > > > > > > > Thanks, > > > > Alex > > > > > > > > -- > > > > Alex Goodman > > > > > > > > > > -- Alex Goodman
