Mike, To each their own. Folks are free in the project to scratch their own itches. I don't anticipate having the time to implement this, but if Alex does, and if he proceeds down that path, I encourage him (you, anyone) to not rewrite a Quad Tree, which it seemed like that discussion line was headed down. Quad Trees already exist and should be reused.
Chris -----Original Message----- From: Michael Joyce <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, July 30, 2013 2:03 PM To: dev <[email protected]> Subject: Re: OCW Refactoring and Subregions >While an interesting idea this seems like unnecessary complexity. If you >want to subregion a subregion make a new Evaluation. If we stick with the >simple approach subregion information can remain outside of a Dataset and >instead be passed to the Evaluation (which seems better). The eval object >can then just subset and evaluation as appropriate and the subregion >information need not be more complicated than a list of bounds. As in: > >(The current form which assumes bounding rectangles) >[[latMin, lonMin, latMax, lonMax], [latMin, lonMin, latMax, lonMax], ...] > >(Possible future form which is more generic) >[BoundsObject, BoundsObject, ....] > >Maybe I'm missing something though? > > >-- Joyce > > >On Tue, Jul 30, 2013 at 10:49 AM, Mattmann, Chris A (398J) < >[email protected]> wrote: > >> Alex this sounds like a classic QuadTree: >> >> http://en.wikipedia.org/wiki/Quadtree >> >> >> One of the fundamental spatial data structures. In Apache SIS we have >> a Quad Tree implementation, but it's in Java: >> >> http://s.apache.org/MDi >> >> >> Maybe we should implement in OCW, or see if we can use SIS's QuadTree >> through JCC: >> >> http://lucene.apache.org/pylucene/jcc/ >> >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: <Goodman>, "Alexander (398J-Affiliate)" <[email protected]> >> Reply-To: "[email protected]" >> <[email protected]> >> Date: Tuesday, July 30, 2013 1:39 PM >> To: "[email protected]" >><[email protected]> >> Subject: Re: OCW Refactoring and Subregions >> >> >Honestly, if we just treat subregions as independent datasets without >> >special attributes, things can get a little hairy. Right now >>subregions by >> >themselves are defined as simple bounding boxes (or in the future more >> >complex polygons). However these are not tied to one particular >>dataset, >> >they are instead applied to multiple datasets. The proposed >> >SubregionDataset, on the other hand, must be tied to a parent Dataset >> >instance, because subsetted data isn't very useful without the original >> >context. I think this depends on what is meant by "grouping" here, but >>if >> >we for example had some function in dataset_processor that spits out >>the >> >subsetted datasets and then pass them in to the evaluation later as >> >generic >> >datasets, how will the user know which dataset the subsetted data >> >originally belonged to without messing around with the name attribute? >> > >> >If we decide that a Dataset subclass for subregions is irrelevant and >>keep >> >one generic Dataset class, allow me to propose the following >>modification: >> >From a computer science perspective I think it might be most >>convenient to >> >think of subregion datasets as a tree data structure, with the original >> >dataset as the root node. This type of implementation would make it >> >possible to have several levels of complexity (eg subregions of >> >subregions) >> >but for the sake of argument let's just say that our tree always has a >> >depth of one since I don't think any scientist would need more than >>that. >> >This could be implemented by giving each Dataset a subregions attribute >> >containing a dictionary of datasets keyed by subregion name, and of >>course >> >the values are just new datasets (ideally making sure that dataset >>lats, >> >lons, and values are views of their corresponding attributes in the >>root >> >dataset and not deep copies). Then we could define our metrics so that >> >they >> >could work recursively on each dataset, first running on the entire >>thing, >> >then repeating on each child (subregion). A potential disadvantage I >>might >> >see from this is then figuring out the best way to structure the >>result of >> >the metric. You also need to decide how to then add new subregions to >>the >> >dataset, either directly to the dictionary and/or argument in the >> >constructor or solely through dataset_processor. >> > >> >Alternatively it would still be fine to make a new subclass of Dataset >>and >> >have the primary attributes be multidimensional arrays containing the >> >subsetted data, but doing this might be less elegant since you would >>then >> >need to process it differently in the evaluation. >> > >> >Any more thoughts? The tree idea was merely something I just recently >>came >> >up with out of the blue, so I am not sure if I covered all the possible >> >drawbacks. >> > >> >Thanks, >> >Alex >> > >> > >> >On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]> >>wrote: >> > >> >> I think the most important thing that we need to decide on is: >> >> >> >> Do we treat subregions of a dataset as a single object. As in >> >> >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions, >> >> aDataset) >> >> >> >> In this case, what would aSubregionedDataset look like? Would it >>have a >> >> list of lat lists? One list for each subregion that was taken from >>the >> >> dataset? >> >> >> >> or is a subregion effectively just a dataset? As in >> >> >> >> >>> [firstDataset, secondDataset, ..., nthDataset] = >> >> DatasetProcessor.subregion(someNSubregions, aDataset) >> >> >> >> If we go with the second approach, I don't see the point in making a >> >> distinction. If a subregion is really just a subset of a dataset then >> >> there's not purpose in separating the two. The user should be >> >>responsible >> >> for properly grouping datasets (that happen to be subregions) into >>the >> >> Evaluation and passing them in the expected grouping for plotting. In >> >>this >> >> case, the Evaluation object doesn't treat a subregion differently at >> >>all. >> >> It's just a Dataset that get's run through everything like normal. >> >> >> >> If the only purpose for making a special distinction between a >>subregion >> >> and a dataset is for grouping convenience then we really need to ask >> >> ourselves if the user should be responsible for handling the grouping >> >>so we >> >> can simplify the system. Personally, I think the user should be >> >>responsible >> >> for this work. However, that's only because as far as I can tell a >> >> subregion is just a Dataset with an adjusted bounding box. Perhaps >>I'm >> >> oversimplifying. >> >> >> >> >> >> -- Joyce >> >> >> >> >> >> On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale <[email protected]> >> >> wrote: >> >> >> >> > I think option 3 is the best given the rationale that has been >>stated >> >> > previously. >> >> > >> >> > I can add a function to the dataset_processor module that will take >> >>in a >> >> > single Dataset Object and a list of SubRegion Specifications >>(north, >> >> south, >> >> > east, west, Name), and it could return a tuple of SubRegion objects >> >>with >> >> a >> >> > length equal to the number of SubRegion Specs. >> >> > >> >> > SubClassing Dataset makes sense because a Dataset and SubRegion >>share >> >> > common attributes, but after talking with Mike about the two, can a >> >> future >> >> > science user please give me a clear difference between a Dataset >>and a >> >> > SubRegion? >> >> > >> >> > I hope a SubRegion assumes specific Metrics to be run, that cannot >>be >> >>run >> >> > on a Dataset. I fear if SubRegion and Dataset are too similar it >>will >> >> > merely confuse users (and software engineers) about when to use >>which >> >> one. >> >> > >> >> > Can anyone articulate the difference between a Dataset and >>SubRegion >> >>for >> >> > me? >> >> > >> >> > >> >> > Thanks, >> >> > >> >> > >> >> > Cameron >> >> > >> >> > >> >> > On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]> >> >> wrote: >> >> > >> >> > > You covered most everything Alex. >> >> > > >> >> > > I'm a fan of inheriting from Dataset to handle Subregions. The >>user >> >>can >> >> > > still add the "dataset" the same way to an Evaluation. Then the >> >> > Evaluation >> >> > > instance can run a separate eval loop to handle subregions. It >>makes >> >> > > Evaluation more complicated but using naming convention to >> >>designate a >> >> > > subregion will just be worse I feel. The DatasetProcessor could >> >>have a >> >> > > function that takes a Dataset and subregion information and spits >> >>out a >> >> > new >> >> > > SubregionDataset (or some such meaningful name) instance that the >> >>user >> >> > can >> >> > > add to the Evaluation. >> >> > > >> >> > > What does everyone think would be a good way of handling this? >> >> > > >> >> > > >> >> > > -- Joyce >> >> > > >> >> > > >> >> > > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander >> >>(398J-Affiliate) < >> >> > > [email protected]> wrote: >> >> > > >> >> > > > Hi all, >> >> > > > >> >> > > > Being able to account for subregions will be a crucial part of >> >> running >> >> > an >> >> > > > evaluation and making the right plots as part of our OCW >> >>refactoring. >> >> > > Mike >> >> > > > and I had a discussion last Friday on some ways to do this and >>we >> >> both >> >> > > > thought that the best approach would make use of the Dataset >>class >> >> > > somehow. >> >> > > > Some specific ideas we had include: >> >> > > > >> >> > > > 1) Designate datasets as subregional by convention. >>Specifically, >> >> this >> >> > > > could be something like making a new dataset instance with the >> >>same >> >> > name >> >> > > as >> >> > > > the parent dataset but with the subregion name appended to the >>end >> >> > with a >> >> > > > leading underscore (eg name_R01, name_R02). >> >> > > > >> >> > > > 2) Values for a particular subregion could placed in a list or >> >> > dictionary >> >> > > > as an attribute of Dataset. >> >> > > > >> >> > > > 3) Make a subclass of Dataset explicitly for subregions. >> >> > > > >> >> > > > In general, any approach will add an additional complication to >> >>some >> >> > > > component of the new OCW code in that the evaluation results / >> >> datasets >> >> > > > need to get grouped together by subregion. >> >> > > > >> >> > > > My preferred approach is (3) since it adds the least amount of >> >> > > complication >> >> > > > to the plotting. I particularly don't like (1) since enforcing >>a >> >>rule >> >> > by >> >> > > > convention would add restrictions to users on valid names for >> >> datasets, >> >> > > for >> >> > > > example a dataset name like 'TRMM_hourly_precip' would make it >> >> > difficult >> >> > > to >> >> > > > incorporate subregions. >> >> > > > >> >> > > > Mike, my memory since our last meeting is a bit fuzzy so please >> >> clarify >> >> > > or >> >> > > > correct any of my points if I am wrong here. I would like to >>hear >> >> other >> >> > > > ideas or opinions as to the best approach for the subregion >> >>problem. >> >> > > > >> >> > > > Thanks, >> >> > > > Alex >> >> > > > >> >> > > > -- >> >> > > > Alex Goodman >> >> > > > >> >> > > >> >> > >> >> >> > >> > >> > >> >-- >> >Alex Goodman >>
