While an interesting idea this seems like unnecessary complexity. If you want to subregion a subregion make a new Evaluation. If we stick with the simple approach subregion information can remain outside of a Dataset and instead be passed to the Evaluation (which seems better). The eval object can then just subset and evaluation as appropriate and the subregion information need not be more complicated than a list of bounds. As in:
(The current form which assumes bounding rectangles) [[latMin, lonMin, latMax, lonMax], [latMin, lonMin, latMax, lonMax], ...] (Possible future form which is more generic) [BoundsObject, BoundsObject, ....] Maybe I'm missing something though? -- Joyce On Tue, Jul 30, 2013 at 10:49 AM, Mattmann, Chris A (398J) < [email protected]> wrote: > Alex this sounds like a classic QuadTree: > > http://en.wikipedia.org/wiki/Quadtree > > > One of the fundamental spatial data structures. In Apache SIS we have > a Quad Tree implementation, but it's in Java: > > http://s.apache.org/MDi > > > Maybe we should implement in OCW, or see if we can use SIS's QuadTree > through JCC: > > http://lucene.apache.org/pylucene/jcc/ > > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: <Goodman>, "Alexander (398J-Affiliate)" <[email protected]> > Reply-To: "[email protected]" > <[email protected]> > Date: Tuesday, July 30, 2013 1:39 PM > To: "[email protected]" <[email protected]> > Subject: Re: OCW Refactoring and Subregions > > >Honestly, if we just treat subregions as independent datasets without > >special attributes, things can get a little hairy. Right now subregions by > >themselves are defined as simple bounding boxes (or in the future more > >complex polygons). However these are not tied to one particular dataset, > >they are instead applied to multiple datasets. The proposed > >SubregionDataset, on the other hand, must be tied to a parent Dataset > >instance, because subsetted data isn't very useful without the original > >context. I think this depends on what is meant by "grouping" here, but if > >we for example had some function in dataset_processor that spits out the > >subsetted datasets and then pass them in to the evaluation later as > >generic > >datasets, how will the user know which dataset the subsetted data > >originally belonged to without messing around with the name attribute? > > > >If we decide that a Dataset subclass for subregions is irrelevant and keep > >one generic Dataset class, allow me to propose the following modification: > >From a computer science perspective I think it might be most convenient to > >think of subregion datasets as a tree data structure, with the original > >dataset as the root node. This type of implementation would make it > >possible to have several levels of complexity (eg subregions of > >subregions) > >but for the sake of argument let's just say that our tree always has a > >depth of one since I don't think any scientist would need more than that. > >This could be implemented by giving each Dataset a subregions attribute > >containing a dictionary of datasets keyed by subregion name, and of course > >the values are just new datasets (ideally making sure that dataset lats, > >lons, and values are views of their corresponding attributes in the root > >dataset and not deep copies). Then we could define our metrics so that > >they > >could work recursively on each dataset, first running on the entire thing, > >then repeating on each child (subregion). A potential disadvantage I might > >see from this is then figuring out the best way to structure the result of > >the metric. You also need to decide how to then add new subregions to the > >dataset, either directly to the dictionary and/or argument in the > >constructor or solely through dataset_processor. > > > >Alternatively it would still be fine to make a new subclass of Dataset and > >have the primary attributes be multidimensional arrays containing the > >subsetted data, but doing this might be less elegant since you would then > >need to process it differently in the evaluation. > > > >Any more thoughts? The tree idea was merely something I just recently came > >up with out of the blue, so I am not sure if I covered all the possible > >drawbacks. > > > >Thanks, > >Alex > > > > > >On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]> wrote: > > > >> I think the most important thing that we need to decide on is: > >> > >> Do we treat subregions of a dataset as a single object. As in > >> >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions, > >> aDataset) > >> > >> In this case, what would aSubregionedDataset look like? Would it have a > >> list of lat lists? One list for each subregion that was taken from the > >> dataset? > >> > >> or is a subregion effectively just a dataset? As in > >> > >> >>> [firstDataset, secondDataset, ..., nthDataset] = > >> DatasetProcessor.subregion(someNSubregions, aDataset) > >> > >> If we go with the second approach, I don't see the point in making a > >> distinction. If a subregion is really just a subset of a dataset then > >> there's not purpose in separating the two. The user should be > >>responsible > >> for properly grouping datasets (that happen to be subregions) into the > >> Evaluation and passing them in the expected grouping for plotting. In > >>this > >> case, the Evaluation object doesn't treat a subregion differently at > >>all. > >> It's just a Dataset that get's run through everything like normal. > >> > >> If the only purpose for making a special distinction between a subregion > >> and a dataset is for grouping convenience then we really need to ask > >> ourselves if the user should be responsible for handling the grouping > >>so we > >> can simplify the system. Personally, I think the user should be > >>responsible > >> for this work. However, that's only because as far as I can tell a > >> subregion is just a Dataset with an adjusted bounding box. Perhaps I'm > >> oversimplifying. > >> > >> > >> -- Joyce > >> > >> > >> On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale <[email protected]> > >> wrote: > >> > >> > I think option 3 is the best given the rationale that has been stated > >> > previously. > >> > > >> > I can add a function to the dataset_processor module that will take > >>in a > >> > single Dataset Object and a list of SubRegion Specifications (north, > >> south, > >> > east, west, Name), and it could return a tuple of SubRegion objects > >>with > >> a > >> > length equal to the number of SubRegion Specs. > >> > > >> > SubClassing Dataset makes sense because a Dataset and SubRegion share > >> > common attributes, but after talking with Mike about the two, can a > >> future > >> > science user please give me a clear difference between a Dataset and a > >> > SubRegion? > >> > > >> > I hope a SubRegion assumes specific Metrics to be run, that cannot be > >>run > >> > on a Dataset. I fear if SubRegion and Dataset are too similar it will > >> > merely confuse users (and software engineers) about when to use which > >> one. > >> > > >> > Can anyone articulate the difference between a Dataset and SubRegion > >>for > >> > me? > >> > > >> > > >> > Thanks, > >> > > >> > > >> > Cameron > >> > > >> > > >> > On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]> > >> wrote: > >> > > >> > > You covered most everything Alex. > >> > > > >> > > I'm a fan of inheriting from Dataset to handle Subregions. The user > >>can > >> > > still add the "dataset" the same way to an Evaluation. Then the > >> > Evaluation > >> > > instance can run a separate eval loop to handle subregions. It makes > >> > > Evaluation more complicated but using naming convention to > >>designate a > >> > > subregion will just be worse I feel. The DatasetProcessor could > >>have a > >> > > function that takes a Dataset and subregion information and spits > >>out a > >> > new > >> > > SubregionDataset (or some such meaningful name) instance that the > >>user > >> > can > >> > > add to the Evaluation. > >> > > > >> > > What does everyone think would be a good way of handling this? > >> > > > >> > > > >> > > -- Joyce > >> > > > >> > > > >> > > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander > >>(398J-Affiliate) < > >> > > [email protected]> wrote: > >> > > > >> > > > Hi all, > >> > > > > >> > > > Being able to account for subregions will be a crucial part of > >> running > >> > an > >> > > > evaluation and making the right plots as part of our OCW > >>refactoring. > >> > > Mike > >> > > > and I had a discussion last Friday on some ways to do this and we > >> both > >> > > > thought that the best approach would make use of the Dataset class > >> > > somehow. > >> > > > Some specific ideas we had include: > >> > > > > >> > > > 1) Designate datasets as subregional by convention. Specifically, > >> this > >> > > > could be something like making a new dataset instance with the > >>same > >> > name > >> > > as > >> > > > the parent dataset but with the subregion name appended to the end > >> > with a > >> > > > leading underscore (eg name_R01, name_R02). > >> > > > > >> > > > 2) Values for a particular subregion could placed in a list or > >> > dictionary > >> > > > as an attribute of Dataset. > >> > > > > >> > > > 3) Make a subclass of Dataset explicitly for subregions. > >> > > > > >> > > > In general, any approach will add an additional complication to > >>some > >> > > > component of the new OCW code in that the evaluation results / > >> datasets > >> > > > need to get grouped together by subregion. > >> > > > > >> > > > My preferred approach is (3) since it adds the least amount of > >> > > complication > >> > > > to the plotting. I particularly don't like (1) since enforcing a > >>rule > >> > by > >> > > > convention would add restrictions to users on valid names for > >> datasets, > >> > > for > >> > > > example a dataset name like 'TRMM_hourly_precip' would make it > >> > difficult > >> > > to > >> > > > incorporate subregions. > >> > > > > >> > > > Mike, my memory since our last meeting is a bit fuzzy so please > >> clarify > >> > > or > >> > > > correct any of my points if I am wrong here. I would like to hear > >> other > >> > > > ideas or opinions as to the best approach for the subregion > >>problem. > >> > > > > >> > > > Thanks, > >> > > > Alex > >> > > > > >> > > > -- > >> > > > Alex Goodman > >> > > > > >> > > > >> > > >> > > > > > > > >-- > >Alex Goodman > >
