Re: OCW Refactoring and Subregions

Mattmann, Chris A (398J) Wed, 31 Jul 2013 09:13:34 -0700

Mike,

To each their own. Folks are free in the project to scratch their
own itches. I don't anticipate having the time to implement this,
but if Alex does, and if he proceeds down that path, I encourage him
(you, anyone) to not rewrite a Quad Tree, which it seemed like that
discussion line was headed down. Quad Trees already exist and should
be reused.


Chris



-----Original Message-----
From: Michael Joyce <[email protected]>
Reply-To: "[email protected]"
<[email protected]>
Date: Tuesday, July 30, 2013 2:03 PM
To: dev <[email protected]>
Subject: Re: OCW Refactoring and Subregions

>While an interesting idea this seems like unnecessary complexity. If you
>want to subregion a subregion make a new Evaluation. If we stick with the
>simple approach subregion information can remain outside of a Dataset and
>instead be passed to the Evaluation (which seems better). The eval object
>can then just subset and evaluation as appropriate and the subregion
>information need not be more complicated than a list of bounds. As in:
>
>(The current form which assumes bounding rectangles)
>[[latMin, lonMin, latMax, lonMax], [latMin, lonMin, latMax, lonMax], ...]
>
>(Possible future form which is more generic)
>[BoundsObject, BoundsObject, ....]
>
>Maybe I'm missing something though?
>
>
>-- Joyce
>
>
>On Tue, Jul 30, 2013 at 10:49 AM, Mattmann, Chris A (398J) <
>[email protected]> wrote:
>
>> Alex this sounds like a classic QuadTree:
>>
>> http://en.wikipedia.org/wiki/Quadtree
>>
>>
>> One of the fundamental spatial data structures. In Apache SIS we have
>> a Quad Tree implementation, but it's in Java:
>>
>> http://s.apache.org/MDi
>>
>>
>> Maybe we should implement in OCW, or see if we can use SIS's QuadTree
>> through JCC:
>>
>> http://lucene.apache.org/pylucene/jcc/
>>
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <Goodman>, "Alexander   (398J-Affiliate)" <[email protected]>
>> Reply-To: "[email protected]"
>> <[email protected]>
>> Date: Tuesday, July 30, 2013 1:39 PM
>> To: "[email protected]"
>><[email protected]>
>> Subject: Re: OCW Refactoring and Subregions
>>
>> >Honestly, if we just treat subregions as independent datasets without
>> >special attributes, things can get a little hairy. Right now
>>subregions by
>> >themselves are defined as simple bounding boxes (or in the future more
>> >complex polygons). However these are not tied to one particular
>>dataset,
>> >they are instead applied to multiple datasets. The proposed
>> >SubregionDataset, on the other hand, must be tied to a parent Dataset
>> >instance, because subsetted data isn't very useful without the original
>> >context. I think this depends on what is meant by "grouping" here, but
>>if
>> >we for example had some function in dataset_processor that spits out
>>the
>> >subsetted datasets and then pass them in to the evaluation later as
>> >generic
>> >datasets, how will the user know which dataset the subsetted data
>> >originally belonged to without messing around with the name attribute?
>> >
>> >If we decide that a Dataset subclass for subregions is irrelevant and
>>keep
>> >one generic Dataset class, allow me to propose the following
>>modification:
>> >From a computer science perspective I think it might be most
>>convenient to
>> >think of subregion datasets as a tree data structure, with the original
>> >dataset as the root node. This type of implementation would make it
>> >possible to have several levels of complexity (eg subregions of
>> >subregions)
>> >but for the sake of argument let's just say that our tree always has a
>> >depth of one since I don't think any scientist would need more than
>>that.
>> >This could be implemented by giving each Dataset a subregions attribute
>> >containing a dictionary of datasets keyed by subregion name, and of
>>course
>> >the values are just new datasets (ideally making sure that dataset
>>lats,
>> >lons, and values are views of their corresponding attributes in the
>>root
>> >dataset and not deep copies). Then we could define our metrics so that
>> >they
>> >could work recursively on each dataset, first running on the entire
>>thing,
>> >then repeating on each child (subregion). A potential disadvantage I
>>might
>> >see from this is then figuring out the best way to structure the
>>result of
>> >the metric. You also need to decide how to then add new subregions to
>>the
>> >dataset, either directly to the dictionary and/or argument in the
>> >constructor or solely through dataset_processor.
>> >
>> >Alternatively it would still be fine to make a new subclass of Dataset
>>and
>> >have the primary attributes be multidimensional arrays containing the
>> >subsetted data, but doing this might be less elegant since you would
>>then
>> >need to process it differently in the evaluation.
>> >
>> >Any more thoughts? The tree idea was merely something I just recently
>>came
>> >up with out of the blue, so I am not sure if I covered all the possible
>> >drawbacks.
>> >
>> >Thanks,
>> >Alex
>> >
>> >
>> >On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]>
>>wrote:
>> >
>> >> I think the most important thing that we need to decide on is:
>> >>
>> >> Do we treat subregions of a dataset as a single object. As in
>> >> >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions,
>> >> aDataset)
>> >>
>> >> In this case, what would aSubregionedDataset look like? Would it
>>have a
>> >> list of lat lists? One list for each subregion that was taken from
>>the
>> >> dataset?
>> >>
>> >> or is a subregion effectively just a dataset? As in
>> >>
>> >> >>> [firstDataset, secondDataset, ..., nthDataset] =
>> >> DatasetProcessor.subregion(someNSubregions, aDataset)
>> >>
>> >> If we go with the second approach, I don't see the point in making a
>> >> distinction. If a subregion is really just a subset of a dataset then
>> >> there's not purpose in separating the two. The user should be
>> >>responsible
>> >> for properly grouping datasets (that happen to be subregions) into
>>the
>> >> Evaluation and passing them in the expected grouping for plotting. In
>> >>this
>> >> case, the Evaluation object doesn't treat a subregion differently at
>> >>all.
>> >> It's just a Dataset that get's run through everything like normal.
>> >>
>> >> If the only purpose for making a special distinction between a
>>subregion
>> >> and a dataset is for grouping convenience then we really need to ask
>> >> ourselves if the user should be responsible for handling the grouping
>> >>so we
>> >> can simplify the system. Personally, I think the user should be
>> >>responsible
>> >> for this work. However, that's only because as far as I can tell a
>> >> subregion is just a Dataset with an adjusted bounding box. Perhaps
>>I'm
>> >> oversimplifying.
>> >>
>> >>
>> >> -- Joyce
>> >>
>> >>
>> >> On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale <[email protected]>
>> >> wrote:
>> >>
>> >> > I think option 3 is the best given the rationale that has been
>>stated
>> >> > previously.
>> >> >
>> >> > I can add a function to the dataset_processor module that will take
>> >>in a
>> >> > single Dataset Object and a list of SubRegion Specifications
>>(north,
>> >> south,
>> >> > east, west, Name), and it could return a tuple of SubRegion objects
>> >>with
>> >> a
>> >> > length equal to the number of SubRegion Specs.
>> >> >
>> >> > SubClassing Dataset makes sense because a Dataset and SubRegion
>>share
>> >> > common attributes, but after talking with Mike about the two, can a
>> >> future
>> >> > science user please give me a clear difference between a Dataset
>>and a
>> >> > SubRegion?
>> >> >
>> >> > I hope a SubRegion assumes specific Metrics to be run, that cannot
>>be
>> >>run
>> >> > on a Dataset.  I fear if SubRegion and Dataset are too similar it
>>will
>> >> > merely confuse users (and software engineers) about when to use
>>which
>> >> one.
>> >> >
>> >> > Can anyone articulate the difference between a Dataset and
>>SubRegion
>> >>for
>> >> > me?
>> >> >
>> >> >
>> >> > Thanks,
>> >> >
>> >> >
>> >> > Cameron
>> >> >
>> >> >
>> >> > On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]>
>> >> wrote:
>> >> >
>> >> > > You covered most everything Alex.
>> >> > >
>> >> > > I'm a fan of inheriting from Dataset to handle Subregions. The
>>user
>> >>can
>> >> > > still add the "dataset" the same way to an Evaluation. Then the
>> >> > Evaluation
>> >> > > instance can run a separate eval loop to handle subregions. It
>>makes
>> >> > > Evaluation more complicated but using naming convention to
>> >>designate a
>> >> > > subregion will just be worse I feel. The DatasetProcessor could
>> >>have a
>> >> > > function that takes a Dataset and subregion information and spits
>> >>out a
>> >> > new
>> >> > > SubregionDataset (or some such meaningful name) instance that the
>> >>user
>> >> > can
>> >> > > add to the Evaluation.
>> >> > >
>> >> > > What does everyone think would be a good way of handling this?
>> >> > >
>> >> > >
>> >> > > -- Joyce
>> >> > >
>> >> > >
>> >> > > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander
>> >>(398J-Affiliate) <
>> >> > > [email protected]> wrote:
>> >> > >
>> >> > > > Hi all,
>> >> > > >
>> >> > > > Being able to account for subregions will be a crucial part of
>> >> running
>> >> > an
>> >> > > > evaluation and making the right plots as part of our OCW
>> >>refactoring.
>> >> > > Mike
>> >> > > > and I had a discussion last Friday on some ways to do this and
>>we
>> >> both
>> >> > > > thought that the best approach would make use of the Dataset
>>class
>> >> > > somehow.
>> >> > > > Some specific ideas we had include:
>> >> > > >
>> >> > > > 1) Designate datasets as subregional by convention.
>>Specifically,
>> >> this
>> >> > > > could be something like making a new dataset instance with the
>> >>same
>> >> > name
>> >> > > as
>> >> > > > the parent dataset but with the subregion name appended to the
>>end
>> >> > with a
>> >> > > > leading underscore (eg name_R01, name_R02).
>> >> > > >
>> >> > > > 2) Values for a particular subregion could placed in a list or
>> >> > dictionary
>> >> > > > as an attribute of Dataset.
>> >> > > >
>> >> > > > 3) Make a subclass of Dataset explicitly for subregions.
>> >> > > >
>> >> > > > In general, any approach will add an additional complication to
>> >>some
>> >> > > > component of the new OCW code in that the evaluation results /
>> >> datasets
>> >> > > > need to get grouped together by subregion.
>> >> > > >
>> >> > > > My preferred approach is (3) since it adds the least amount of
>> >> > > complication
>> >> > > > to the plotting. I particularly don't like (1) since enforcing
>>a
>> >>rule
>> >> > by
>> >> > > > convention would add restrictions to users on valid names for
>> >> datasets,
>> >> > > for
>> >> > > > example a dataset name like 'TRMM_hourly_precip' would make it
>> >> > difficult
>> >> > > to
>> >> > > > incorporate subregions.
>> >> > > >
>> >> > > > Mike, my memory since our last meeting is a bit fuzzy so please
>> >> clarify
>> >> > > or
>> >> > > > correct any of my points if I am wrong here. I would like to
>>hear
>> >> other
>> >> > > > ideas or opinions as to the best approach for the subregion
>> >>problem.
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Alex
>> >> > > >
>> >> > > > --
>> >> > > > Alex Goodman
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>> >
>> >--
>> >Alex Goodman
>>

Re: OCW Refactoring and Subregions

Reply via email to