Alex this sounds like a classic QuadTree: http://en.wikipedia.org/wiki/Quadtree
One of the fundamental spatial data structures. In Apache SIS we have a Quad Tree implementation, but it's in Java: http://s.apache.org/MDi Maybe we should implement in OCW, or see if we can use SIS's QuadTree through JCC: http://lucene.apache.org/pylucene/jcc/ Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: <Goodman>, "Alexander (398J-Affiliate)" <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, July 30, 2013 1:39 PM To: "[email protected]" <[email protected]> Subject: Re: OCW Refactoring and Subregions >Honestly, if we just treat subregions as independent datasets without >special attributes, things can get a little hairy. Right now subregions by >themselves are defined as simple bounding boxes (or in the future more >complex polygons). However these are not tied to one particular dataset, >they are instead applied to multiple datasets. The proposed >SubregionDataset, on the other hand, must be tied to a parent Dataset >instance, because subsetted data isn't very useful without the original >context. I think this depends on what is meant by "grouping" here, but if >we for example had some function in dataset_processor that spits out the >subsetted datasets and then pass them in to the evaluation later as >generic >datasets, how will the user know which dataset the subsetted data >originally belonged to without messing around with the name attribute? > >If we decide that a Dataset subclass for subregions is irrelevant and keep >one generic Dataset class, allow me to propose the following modification: >From a computer science perspective I think it might be most convenient to >think of subregion datasets as a tree data structure, with the original >dataset as the root node. This type of implementation would make it >possible to have several levels of complexity (eg subregions of >subregions) >but for the sake of argument let's just say that our tree always has a >depth of one since I don't think any scientist would need more than that. >This could be implemented by giving each Dataset a subregions attribute >containing a dictionary of datasets keyed by subregion name, and of course >the values are just new datasets (ideally making sure that dataset lats, >lons, and values are views of their corresponding attributes in the root >dataset and not deep copies). Then we could define our metrics so that >they >could work recursively on each dataset, first running on the entire thing, >then repeating on each child (subregion). A potential disadvantage I might >see from this is then figuring out the best way to structure the result of >the metric. You also need to decide how to then add new subregions to the >dataset, either directly to the dictionary and/or argument in the >constructor or solely through dataset_processor. > >Alternatively it would still be fine to make a new subclass of Dataset and >have the primary attributes be multidimensional arrays containing the >subsetted data, but doing this might be less elegant since you would then >need to process it differently in the evaluation. > >Any more thoughts? The tree idea was merely something I just recently came >up with out of the blue, so I am not sure if I covered all the possible >drawbacks. > >Thanks, >Alex > > >On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]> wrote: > >> I think the most important thing that we need to decide on is: >> >> Do we treat subregions of a dataset as a single object. As in >> >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions, >> aDataset) >> >> In this case, what would aSubregionedDataset look like? Would it have a >> list of lat lists? One list for each subregion that was taken from the >> dataset? >> >> or is a subregion effectively just a dataset? As in >> >> >>> [firstDataset, secondDataset, ..., nthDataset] = >> DatasetProcessor.subregion(someNSubregions, aDataset) >> >> If we go with the second approach, I don't see the point in making a >> distinction. If a subregion is really just a subset of a dataset then >> there's not purpose in separating the two. The user should be >>responsible >> for properly grouping datasets (that happen to be subregions) into the >> Evaluation and passing them in the expected grouping for plotting. In >>this >> case, the Evaluation object doesn't treat a subregion differently at >>all. >> It's just a Dataset that get's run through everything like normal. >> >> If the only purpose for making a special distinction between a subregion >> and a dataset is for grouping convenience then we really need to ask >> ourselves if the user should be responsible for handling the grouping >>so we >> can simplify the system. Personally, I think the user should be >>responsible >> for this work. However, that's only because as far as I can tell a >> subregion is just a Dataset with an adjusted bounding box. Perhaps I'm >> oversimplifying. >> >> >> -- Joyce >> >> >> On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale <[email protected]> >> wrote: >> >> > I think option 3 is the best given the rationale that has been stated >> > previously. >> > >> > I can add a function to the dataset_processor module that will take >>in a >> > single Dataset Object and a list of SubRegion Specifications (north, >> south, >> > east, west, Name), and it could return a tuple of SubRegion objects >>with >> a >> > length equal to the number of SubRegion Specs. >> > >> > SubClassing Dataset makes sense because a Dataset and SubRegion share >> > common attributes, but after talking with Mike about the two, can a >> future >> > science user please give me a clear difference between a Dataset and a >> > SubRegion? >> > >> > I hope a SubRegion assumes specific Metrics to be run, that cannot be >>run >> > on a Dataset. I fear if SubRegion and Dataset are too similar it will >> > merely confuse users (and software engineers) about when to use which >> one. >> > >> > Can anyone articulate the difference between a Dataset and SubRegion >>for >> > me? >> > >> > >> > Thanks, >> > >> > >> > Cameron >> > >> > >> > On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]> >> wrote: >> > >> > > You covered most everything Alex. >> > > >> > > I'm a fan of inheriting from Dataset to handle Subregions. The user >>can >> > > still add the "dataset" the same way to an Evaluation. Then the >> > Evaluation >> > > instance can run a separate eval loop to handle subregions. It makes >> > > Evaluation more complicated but using naming convention to >>designate a >> > > subregion will just be worse I feel. The DatasetProcessor could >>have a >> > > function that takes a Dataset and subregion information and spits >>out a >> > new >> > > SubregionDataset (or some such meaningful name) instance that the >>user >> > can >> > > add to the Evaluation. >> > > >> > > What does everyone think would be a good way of handling this? >> > > >> > > >> > > -- Joyce >> > > >> > > >> > > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander >>(398J-Affiliate) < >> > > [email protected]> wrote: >> > > >> > > > Hi all, >> > > > >> > > > Being able to account for subregions will be a crucial part of >> running >> > an >> > > > evaluation and making the right plots as part of our OCW >>refactoring. >> > > Mike >> > > > and I had a discussion last Friday on some ways to do this and we >> both >> > > > thought that the best approach would make use of the Dataset class >> > > somehow. >> > > > Some specific ideas we had include: >> > > > >> > > > 1) Designate datasets as subregional by convention. Specifically, >> this >> > > > could be something like making a new dataset instance with the >>same >> > name >> > > as >> > > > the parent dataset but with the subregion name appended to the end >> > with a >> > > > leading underscore (eg name_R01, name_R02). >> > > > >> > > > 2) Values for a particular subregion could placed in a list or >> > dictionary >> > > > as an attribute of Dataset. >> > > > >> > > > 3) Make a subclass of Dataset explicitly for subregions. >> > > > >> > > > In general, any approach will add an additional complication to >>some >> > > > component of the new OCW code in that the evaluation results / >> datasets >> > > > need to get grouped together by subregion. >> > > > >> > > > My preferred approach is (3) since it adds the least amount of >> > > complication >> > > > to the plotting. I particularly don't like (1) since enforcing a >>rule >> > by >> > > > convention would add restrictions to users on valid names for >> datasets, >> > > for >> > > > example a dataset name like 'TRMM_hourly_precip' would make it >> > difficult >> > > to >> > > > incorporate subregions. >> > > > >> > > > Mike, my memory since our last meeting is a bit fuzzy so please >> clarify >> > > or >> > > > correct any of my points if I am wrong here. I would like to hear >> other >> > > > ideas or opinions as to the best approach for the subregion >>problem. >> > > > >> > > > Thanks, >> > > > Alex >> > > > >> > > > -- >> > > > Alex Goodman >> > > > >> > > >> > >> > > > >-- >Alex Goodman
