Re: OCW Refactoring and Subregions

Mattmann, Chris A (398J) Tue, 30 Jul 2013 10:51:10 -0700

Alex this sounds like a classic QuadTree:

http://en.wikipedia.org/wiki/Quadtree



One of the fundamental spatial data structures. In Apache SIS we have
a Quad Tree implementation, but it's in Java:

http://s.apache.org/MDi


Maybe we should implement in OCW, or see if we can use SIS's QuadTree
through JCC:

http://lucene.apache.org/pylucene/jcc/


Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Goodman>, "Alexander   (398J-Affiliate)" <[email protected]>
Reply-To: "[email protected]"
<[email protected]>
Date: Tuesday, July 30, 2013 1:39 PM
To: "[email protected]" <[email protected]>
Subject: Re: OCW Refactoring and Subregions

>Honestly, if we just treat subregions as independent datasets without
>special attributes, things can get a little hairy. Right now subregions by
>themselves are defined as simple bounding boxes (or in the future more
>complex polygons). However these are not tied to one particular dataset,
>they are instead applied to multiple datasets. The proposed
>SubregionDataset, on the other hand, must be tied to a parent Dataset
>instance, because subsetted data isn't very useful without the original
>context. I think this depends on what is meant by "grouping" here, but if
>we for example had some function in dataset_processor that spits out the
>subsetted datasets and then pass them in to the evaluation later as
>generic
>datasets, how will the user know which dataset the subsetted data
>originally belonged to without messing around with the name attribute?
>
>If we decide that a Dataset subclass for subregions is irrelevant and keep
>one generic Dataset class, allow me to propose the following modification:
>From a computer science perspective I think it might be most convenient to
>think of subregion datasets as a tree data structure, with the original
>dataset as the root node. This type of implementation would make it
>possible to have several levels of complexity (eg subregions of
>subregions)
>but for the sake of argument let's just say that our tree always has a
>depth of one since I don't think any scientist would need more than that.
>This could be implemented by giving each Dataset a subregions attribute
>containing a dictionary of datasets keyed by subregion name, and of course
>the values are just new datasets (ideally making sure that dataset lats,
>lons, and values are views of their corresponding attributes in the root
>dataset and not deep copies). Then we could define our metrics so that
>they
>could work recursively on each dataset, first running on the entire thing,
>then repeating on each child (subregion). A potential disadvantage I might
>see from this is then figuring out the best way to structure the result of
>the metric. You also need to decide how to then add new subregions to the
>dataset, either directly to the dictionary and/or argument in the
>constructor or solely through dataset_processor.
>
>Alternatively it would still be fine to make a new subclass of Dataset and
>have the primary attributes be multidimensional arrays containing the
>subsetted data, but doing this might be less elegant since you would then
>need to process it differently in the evaluation.
>
>Any more thoughts? The tree idea was merely something I just recently came
>up with out of the blue, so I am not sure if I covered all the possible
>drawbacks.
>
>Thanks,
>Alex
>
>
>On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]> wrote:
>
>> I think the most important thing that we need to decide on is:
>>
>> Do we treat subregions of a dataset as a single object. As in
>> >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions,
>> aDataset)
>>
>> In this case, what would aSubregionedDataset look like? Would it have a
>> list of lat lists? One list for each subregion that was taken from the
>> dataset?
>>
>> or is a subregion effectively just a dataset? As in
>>
>> >>> [firstDataset, secondDataset, ..., nthDataset] =
>> DatasetProcessor.subregion(someNSubregions, aDataset)
>>
>> If we go with the second approach, I don't see the point in making a
>> distinction. If a subregion is really just a subset of a dataset then
>> there's not purpose in separating the two. The user should be
>>responsible
>> for properly grouping datasets (that happen to be subregions) into the
>> Evaluation and passing them in the expected grouping for plotting. In
>>this
>> case, the Evaluation object doesn't treat a subregion differently at
>>all.
>> It's just a Dataset that get's run through everything like normal.
>>
>> If the only purpose for making a special distinction between a subregion
>> and a dataset is for grouping convenience then we really need to ask
>> ourselves if the user should be responsible for handling the grouping
>>so we
>> can simplify the system. Personally, I think the user should be
>>responsible
>> for this work. However, that's only because as far as I can tell a
>> subregion is just a Dataset with an adjusted bounding box. Perhaps I'm
>> oversimplifying.
>>
>>
>> -- Joyce
>>
>>
>> On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale <[email protected]>
>> wrote:
>>
>> > I think option 3 is the best given the rationale that has been stated
>> > previously.
>> >
>> > I can add a function to the dataset_processor module that will take
>>in a
>> > single Dataset Object and a list of SubRegion Specifications (north,
>> south,
>> > east, west, Name), and it could return a tuple of SubRegion objects
>>with
>> a
>> > length equal to the number of SubRegion Specs.
>> >
>> > SubClassing Dataset makes sense because a Dataset and SubRegion share
>> > common attributes, but after talking with Mike about the two, can a
>> future
>> > science user please give me a clear difference between a Dataset and a
>> > SubRegion?
>> >
>> > I hope a SubRegion assumes specific Metrics to be run, that cannot be
>>run
>> > on a Dataset.  I fear if SubRegion and Dataset are too similar it will
>> > merely confuse users (and software engineers) about when to use which
>> one.
>> >
>> > Can anyone articulate the difference between a Dataset and SubRegion
>>for
>> > me?
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Cameron
>> >
>> >
>> > On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]>
>> wrote:
>> >
>> > > You covered most everything Alex.
>> > >
>> > > I'm a fan of inheriting from Dataset to handle Subregions. The user
>>can
>> > > still add the "dataset" the same way to an Evaluation. Then the
>> > Evaluation
>> > > instance can run a separate eval loop to handle subregions. It makes
>> > > Evaluation more complicated but using naming convention to
>>designate a
>> > > subregion will just be worse I feel. The DatasetProcessor could
>>have a
>> > > function that takes a Dataset and subregion information and spits
>>out a
>> > new
>> > > SubregionDataset (or some such meaningful name) instance that the
>>user
>> > can
>> > > add to the Evaluation.
>> > >
>> > > What does everyone think would be a good way of handling this?
>> > >
>> > >
>> > > -- Joyce
>> > >
>> > >
>> > > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander
>>(398J-Affiliate) <
>> > > [email protected]> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Being able to account for subregions will be a crucial part of
>> running
>> > an
>> > > > evaluation and making the right plots as part of our OCW
>>refactoring.
>> > > Mike
>> > > > and I had a discussion last Friday on some ways to do this and we
>> both
>> > > > thought that the best approach would make use of the Dataset class
>> > > somehow.
>> > > > Some specific ideas we had include:
>> > > >
>> > > > 1) Designate datasets as subregional by convention. Specifically,
>> this
>> > > > could be something like making a new dataset instance with the
>>same
>> > name
>> > > as
>> > > > the parent dataset but with the subregion name appended to the end
>> > with a
>> > > > leading underscore (eg name_R01, name_R02).
>> > > >
>> > > > 2) Values for a particular subregion could placed in a list or
>> > dictionary
>> > > > as an attribute of Dataset.
>> > > >
>> > > > 3) Make a subclass of Dataset explicitly for subregions.
>> > > >
>> > > > In general, any approach will add an additional complication to
>>some
>> > > > component of the new OCW code in that the evaluation results /
>> datasets
>> > > > need to get grouped together by subregion.
>> > > >
>> > > > My preferred approach is (3) since it adds the least amount of
>> > > complication
>> > > > to the plotting. I particularly don't like (1) since enforcing a
>>rule
>> > by
>> > > > convention would add restrictions to users on valid names for
>> datasets,
>> > > for
>> > > > example a dataset name like 'TRMM_hourly_precip' would make it
>> > difficult
>> > > to
>> > > > incorporate subregions.
>> > > >
>> > > > Mike, my memory since our last meeting is a bit fuzzy so please
>> clarify
>> > > or
>> > > > correct any of my points if I am wrong here. I would like to hear
>> other
>> > > > ideas or opinions as to the best approach for the subregion
>>problem.
>> > > >
>> > > > Thanks,
>> > > > Alex
>> > > >
>> > > > --
>> > > > Alex Goodman
>> > > >
>> > >
>> >
>>
>
>
>
>-- 
>Alex Goodman

Re: OCW Refactoring and Subregions

Reply via email to