Re: OCW Refactoring and Subregions

Goodman, Alexander (398J-Affiliate) Tue, 30 Jul 2013 10:40:15 -0700

Honestly, if we just treat subregions as independent datasets without
special attributes, things can get a little hairy. Right now subregions by
themselves are defined as simple bounding boxes (or in the future more
complex polygons). However these are not tied to one particular dataset,
they are instead applied to multiple datasets. The proposed
SubregionDataset, on the other hand, must be tied to a parent Dataset
instance, because subsetted data isn't very useful without the original
context. I think this depends on what is meant by "grouping" here, but if
we for example had some function in dataset_processor that spits out the
subsetted datasets and then pass them in to the evaluation later as generic
datasets, how will the user know which dataset the subsetted data
originally belonged to without messing around with the name attribute?


If we decide that a Dataset subclass for subregions is irrelevant and keep
one generic Dataset class, allow me to propose the following modification:
>From a computer science perspective I think it might be most convenient to
think of subregion datasets as a tree data structure, with the original
dataset as the root node. This type of implementation would make it
possible to have several levels of complexity (eg subregions of subregions)
but for the sake of argument let's just say that our tree always has a
depth of one since I don't think any scientist would need more than that.
This could be implemented by giving each Dataset a subregions attribute
containing a dictionary of datasets keyed by subregion name, and of course
the values are just new datasets (ideally making sure that dataset lats,
lons, and values are views of their corresponding attributes in the root
dataset and not deep copies). Then we could define our metrics so that they
could work recursively on each dataset, first running on the entire thing,
then repeating on each child (subregion). A potential disadvantage I might
see from this is then figuring out the best way to structure the result of
the metric. You also need to decide how to then add new subregions to the
dataset, either directly to the dictionary and/or argument in the
constructor or solely through dataset_processor.

Alternatively it would still be fine to make a new subclass of Dataset and
have the primary attributes be multidimensional arrays containing the
subsetted data, but doing this might be less elegant since you would then
need to process it differently in the evaluation.

Any more thoughts? The tree idea was merely something I just recently came
up with out of the blue, so I am not sure if I covered all the possible
drawbacks.

Thanks,
Alex


On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]> wrote:

> I think the most important thing that we need to decide on is:
>
> Do we treat subregions of a dataset as a single object. As in
> >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions,
> aDataset)
>
> In this case, what would aSubregionedDataset look like? Would it have a
> list of lat lists? One list for each subregion that was taken from the
> dataset?
>
> or is a subregion effectively just a dataset? As in
>
> >>> [firstDataset, secondDataset, ..., nthDataset] =
> DatasetProcessor.subregion(someNSubregions, aDataset)
>
> If we go with the second approach, I don't see the point in making a
> distinction. If a subregion is really just a subset of a dataset then
> there's not purpose in separating the two. The user should be responsible
> for properly grouping datasets (that happen to be subregions) into the
> Evaluation and passing them in the expected grouping for plotting. In this
> case, the Evaluation object doesn't treat a subregion differently at all.
> It's just a Dataset that get's run through everything like normal.
>
> If the only purpose for making a special distinction between a subregion
> and a dataset is for grouping convenience then we really need to ask
> ourselves if the user should be responsible for handling the grouping so we
> can simplify the system. Personally, I think the user should be responsible
> for this work. However, that's only because as far as I can tell a
> subregion is just a Dataset with an adjusted bounding box. Perhaps I'm
> oversimplifying.
>
>
> -- Joyce
>
>
> On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale <[email protected]>
> wrote:
>
> > I think option 3 is the best given the rationale that has been stated
> > previously.
> >
> > I can add a function to the dataset_processor module that will take in a
> > single Dataset Object and a list of SubRegion Specifications (north,
> south,
> > east, west, Name), and it could return a tuple of SubRegion objects with
> a
> > length equal to the number of SubRegion Specs.
> >
> > SubClassing Dataset makes sense because a Dataset and SubRegion share
> > common attributes, but after talking with Mike about the two, can a
> future
> > science user please give me a clear difference between a Dataset and a
> > SubRegion?
> >
> > I hope a SubRegion assumes specific Metrics to be run, that cannot be run
> > on a Dataset.  I fear if SubRegion and Dataset are too similar it will
> > merely confuse users (and software engineers) about when to use which
> one.
> >
> > Can anyone articulate the difference between a Dataset and SubRegion for
> > me?
> >
> >
> > Thanks,
> >
> >
> > Cameron
> >
> >
> > On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]>
> wrote:
> >
> > > You covered most everything Alex.
> > >
> > > I'm a fan of inheriting from Dataset to handle Subregions. The user can
> > > still add the "dataset" the same way to an Evaluation. Then the
> > Evaluation
> > > instance can run a separate eval loop to handle subregions. It makes
> > > Evaluation more complicated but using naming convention to designate a
> > > subregion will just be worse I feel. The DatasetProcessor could have a
> > > function that takes a Dataset and subregion information and spits out a
> > new
> > > SubregionDataset (or some such meaningful name) instance that the user
> > can
> > > add to the Evaluation.
> > >
> > > What does everyone think would be a good way of handling this?
> > >
> > >
> > > -- Joyce
> > >
> > >
> > > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander (398J-Affiliate) <
> > > [email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Being able to account for subregions will be a crucial part of
> running
> > an
> > > > evaluation and making the right plots as part of our OCW refactoring.
> > > Mike
> > > > and I had a discussion last Friday on some ways to do this and we
> both
> > > > thought that the best approach would make use of the Dataset class
> > > somehow.
> > > > Some specific ideas we had include:
> > > >
> > > > 1) Designate datasets as subregional by convention. Specifically,
> this
> > > > could be something like making a new dataset instance with the same
> > name
> > > as
> > > > the parent dataset but with the subregion name appended to the end
> > with a
> > > > leading underscore (eg name_R01, name_R02).
> > > >
> > > > 2) Values for a particular subregion could placed in a list or
> > dictionary
> > > > as an attribute of Dataset.
> > > >
> > > > 3) Make a subclass of Dataset explicitly for subregions.
> > > >
> > > > In general, any approach will add an additional complication to some
> > > > component of the new OCW code in that the evaluation results /
> datasets
> > > > need to get grouped together by subregion.
> > > >
> > > > My preferred approach is (3) since it adds the least amount of
> > > complication
> > > > to the plotting. I particularly don't like (1) since enforcing a rule
> > by
> > > > convention would add restrictions to users on valid names for
> datasets,
> > > for
> > > > example a dataset name like 'TRMM_hourly_precip' would make it
> > difficult
> > > to
> > > > incorporate subregions.
> > > >
> > > > Mike, my memory since our last meeting is a bit fuzzy so please
> clarify
> > > or
> > > > correct any of my points if I am wrong here. I would like to hear
> other
> > > > ideas or opinions as to the best approach for the subregion problem.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > > > --
> > > > Alex Goodman
> > > >
> > >
> >
>



-- 
Alex Goodman

Re: OCW Refactoring and Subregions

Reply via email to