RE: OCW Refactoring and Subregions

Kim, Jinwon Wed, 31 Jul 2013 10:02:37 -0700

Subregions are not tied to datasets; they are tied to the evaluation domain.


If you concern is the calculation of subregion-mean time series is done before 
the evaluation step to start, it is because of the option to create a netCDF 
data file of processed (re-gridded + subregion-averaged) data for users' own 
evaluation/application etc. If subregion averaging is done after the creation 
of the netCDF data file, it can result in creating two netCDF data files (one 
for the re-gridded 2-d time series, another for the subregion-averaged time 
series). Because the creation of subregion-mean time series is controlled by 
the entries in the confiture file prepared before running rcmet jobs, where in 
the code the subregion-mean time series are calculated should not be an issue, 
I think.




-----------------------------------------------------------------------------------------------------
Jinwon Kim
Dept. Atmospheric and Oceanic Sciences and
Joint Institute for Regional Earth System Science and Engineering
University of California, Los Angeles
Los Angeles, CA 90095-1565
________________________________________
From: Ramirez, Paul M (398J) [[email protected]]
Sent: Wednesday, July 31, 2013 9:46 AM
To: [email protected]
Subject: Re: OCW Refactoring and Subregions

Jinwon,

Agreed on all your points. I think the discussion was a point of clarity
on when data preparation is complete and an evaluation is ready to go. The
assumption going in at this point is that the evaluation is across
everything in those datasets. Subregions seemed just like a simple case on
top of this where when you get to the evaluation step and you want to do
something over subregions you merely want to subset the data and run
through all the metrics on those subsets.

What I discussed with Mike was why are we trying to tie subregions
to the dataset? It seemed as though the subregions don't belong to any
particular dataset but are part of the input to an evaluation.

--Paul

On 7/30/13 11:00 AM, "Kim, Jinwon" <[email protected]> wrote:

>Please note:
>
>(1) "The assumption so far has been that the datasets in the evaluation
>perfectly overlap both spatially and temporally"
>      >>> there is an exception. if "spatialGrid=user" in the config
>file, the evaluation domain is specified by the user and both model/obs
>data are interpolated onto the user-specified domain.
>(2) "The assumption so far has been that the datasets in the evaluation
>perfectly overlap both spatially and temporally"
>      >>> for temporally, there is a process to check data time steps of
>individual data files (both for model and obs). If time steps are mixed
>(e.g., some in daily and some in monthly), the code temporally regrids
>the daily data into monthly (interpolation in time is not allowed
>currently).
>
>I am using exclusively the user-specified domain option to avoid a
>problem in model data set; although SMHI (also NCAR) interpolated model
>data onto the same domain, the long,lat values in individual netCDF files
>can vary because of truncation error. This problem occurred both in the
>cordex africa (model dataset prepared by SMHI) and narccap (prepared by
>NCAR).
>
>
>
>--------------------------------------------------------------------------
>---------------------------
>Jinwon Kim
>Dept. Atmospheric and Oceanic Sciences and
>Joint Institute for Regional Earth System Science and Engineering
>University of California, Los Angeles
>Los Angeles, CA 90095-1565
>________________________________________
>From: [email protected] [[email protected]] on behalf of Michael Joyce
>[[email protected]]
>Sent: Tuesday, July 30, 2013 10:34 AM
>To: dev
>Subject: Re: OCW Refactoring and Subregions
>
>Had a meeting with Paul. Here's the results.
>
>Currently we haven't specified any region information in the pre-eval/eval
>step. The assumption so far has been that the datasets in the evaluation
>perfectly overlap both spatially and temporally, but there's been no check
>for compliance or a prerequisite function in DatasetProcessor (DSP) that
>does this operation (at least not that I'm aware of). An Evaluation is
>only
>valid if the Datasets overlap perfectly both spatially and temporally. So
>DSP needs a function that takes all the datasets and some bounding
>information and spits out datasets with the correct overlaps (or an error
>if the request isn't possible). To handle Subregions, the Evaluation
>objectal will take an option subregions object that changes how the
>evaluation is run.
>
>If there aren't any subregions we end up with:
>
>>>> results =
>>>> [ # For a metric
>>>> .... [ # For a target dataset
>>>> .... .... # Results for the evaluation with the reference dataset
>>>> .... ]
>>>> ]
>
>If there are subregions we end up with:
>>>> results =
>>>> [ # For a metric
>>>> .... [ # For a target dataset
>>>> .... .... [ # For a subregion
>>>> .... .... .... # Result for a subregion of the target dataset with the
>reference dataset
>>>> .... .... ]
>>>> .... ]
>>>> ]
>
>This means that we need to change a few things. Here's some pseudo-code
>showing this:
>
>Model-to-obs (no subregions)
>
>>>> model = local.load("some/fake/path")
>>>> obs = rcmed.getDataset(someParamIdOrWhateverWeUseToGrabObservations)
>>>>
>>>> DSP.regrid(model, obs)  # I'm not sure what the exact format that we
>use for this, but you get the idea)
>>>> model, obs = DSP.subset(evalRegion, [model, obs]) # Here evalRegion
>contains the spatial/temporal bounds for the evaluation
>>>>
>>>> eval = Evaluation(model, obs, Bias()) # Reference dataset, target
>dataset, and metric(s)
>>>> eval.run()
>
>Model-to-obs (with subregions)
>
>Here we add subregions. A subregion is effectively a spatial bound to run
>on the evaluation. We don't necessarily need to create a new class for
>this, but we need to agree on a way of passing this information. I think
>the best way to handle this is that each subregion is a list of [latMin,
>lonMin, latMax, lonMax]. This gives us:
>
>subregionBounds = [[latMin, lonMin, latMax, lonMax], [latMin, lonMin,
>latMax, lonMax], ...]
>
>The evaluation is then run over each subregion. For a subregion to be
>valid
>it must be a subset of the evalRegion that the Evaluation is run over.
>
>>>> model = local.load("some/fake/path")
>>>> obs = rcmed.getDataset(someParamIdOrWhateverWeUseToGrabObservations)
>>>>
>>>> DSP.regrid(model, obs)  # I'm not sure what the exact format that we
>use for this, but you get the idea)
>>>> model, obs = DSP.subset(evalRegion, [model, obs]) # Here
>subsetInformation contains the spatial/temporal bounds for the evaluation
>>>>
>>>> eval = Evaluation(model, obs, Bias(), subregionBounds) # Reference
>dataset, target dataset, metric(s), subregion bounds
>>>> eval.run()
>
>When doing a mutli-dataset evaluation the calls change just a bit
>
>>>> DSP.regrid(model, targetDatasets)
>>>> model, targetDatasets = DSP.subset(evalRegion, model + targetDatasets)
>>>>
>>>> eval = Evaluation(model, targetDatasets, someListOfModels,
>subregionBounds)
>>>> eval.run()
>
>thoughts?
>
>
>-- Joyce
>
>
>On Tue, Jul 30, 2013 at 8:25 AM, Michael Joyce <[email protected]> wrote:
>
>> I think the most important thing that we need to decide on is:
>>
>> Do we treat subregions of a dataset as a single object. As in
>> >>> aSubregionedDataset = DatasetProcessor.subregion(someSubregions,
>> aDataset)
>>
>> In this case, what would aSubregionedDataset look like? Would it have a
>> list of lat lists? One list for each subregion that was taken from the
>> dataset?
>>
>> or is a subregion effectively just a dataset? As in
>>
>> >>> [firstDataset, secondDataset, ..., nthDataset] =
>> DatasetProcessor.subregion(someNSubregions, aDataset)
>>
>> If we go with the second approach, I don't see the point in making a
>> distinction. If a subregion is really just a subset of a dataset then
>> there's not purpose in separating the two. The user should be
>>responsible
>> for properly grouping datasets (that happen to be subregions) into the
>> Evaluation and passing them in the expected grouping for plotting. In
>>this
>> case, the Evaluation object doesn't treat a subregion differently at
>>all.
>> It's just a Dataset that get's run through everything like normal.
>>
>> If the only purpose for making a special distinction between a subregion
>> and a dataset is for grouping convenience then we really need to ask
>> ourselves if the user should be responsible for handling the grouping
>>so we
>> can simplify the system. Personally, I think the user should be
>>responsible
>> for this work. However, that's only because as far as I can tell a
>> subregion is just a Dataset with an adjusted bounding box. Perhaps I'm
>> oversimplifying.
>>
>>
>> -- Joyce
>>
>>
>> On Tue, Jul 30, 2013 at 8:14 AM, Cameron Goodale
>><[email protected]>wrote:
>>
>>> I think option 3 is the best given the rationale that has been stated
>>> previously.
>>>
>>> I can add a function to the dataset_processor module that will take in
>>>a
>>> single Dataset Object and a list of SubRegion Specifications (north,
>>> south,
>>> east, west, Name), and it could return a tuple of SubRegion objects
>>>with a
>>> length equal to the number of SubRegion Specs.
>>>
>>> SubClassing Dataset makes sense because a Dataset and SubRegion share
>>> common attributes, but after talking with Mike about the two, can a
>>>future
>>> science user please give me a clear difference between a Dataset and a
>>> SubRegion?
>>>
>>> I hope a SubRegion assumes specific Metrics to be run, that cannot be
>>>run
>>> on a Dataset.  I fear if SubRegion and Dataset are too similar it will
>>> merely confuse users (and software engineers) about when to use which
>>>one.
>>>
>>> Can anyone articulate the difference between a Dataset and SubRegion
>>>for
>>> me?
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Cameron
>>>
>>>
>>> On Mon, Jul 29, 2013 at 12:22 PM, Michael Joyce <[email protected]>
>>>wrote:
>>>
>>> > You covered most everything Alex.
>>> >
>>> > I'm a fan of inheriting from Dataset to handle Subregions. The user
>>>can
>>> > still add the "dataset" the same way to an Evaluation. Then the
>>> Evaluation
>>> > instance can run a separate eval loop to handle subregions. It makes
>>> > Evaluation more complicated but using naming convention to designate
>>>a
>>> > subregion will just be worse I feel. The DatasetProcessor could have
>>>a
>>> > function that takes a Dataset and subregion information and spits
>>>out a
>>> new
>>> > SubregionDataset (or some such meaningful name) instance that the
>>>user
>>> can
>>> > add to the Evaluation.
>>> >
>>> > What does everyone think would be a good way of handling this?
>>> >
>>> >
>>> > -- Joyce
>>> >
>>> >
>>> > On Mon, Jul 29, 2013 at 11:36 AM, Goodman, Alexander
>>>(398J-Affiliate) <
>>> > [email protected]> wrote:
>>> >
>>> > > Hi all,
>>> > >
>>> > > Being able to account for subregions will be a crucial part of
>>> running an
>>> > > evaluation and making the right plots as part of our OCW
>>>refactoring.
>>> > Mike
>>> > > and I had a discussion last Friday on some ways to do this and we
>>>both
>>> > > thought that the best approach would make use of the Dataset class
>>> > somehow.
>>> > > Some specific ideas we had include:
>>> > >
>>> > > 1) Designate datasets as subregional by convention. Specifically,
>>>this
>>> > > could be something like making a new dataset instance with the same
>>> name
>>> > as
>>> > > the parent dataset but with the subregion name appended to the end
>>> with a
>>> > > leading underscore (eg name_R01, name_R02).
>>> > >
>>> > > 2) Values for a particular subregion could placed in a list or
>>> dictionary
>>> > > as an attribute of Dataset.
>>> > >
>>> > > 3) Make a subclass of Dataset explicitly for subregions.
>>> > >
>>> > > In general, any approach will add an additional complication to
>>>some
>>> > > component of the new OCW code in that the evaluation results /
>>> datasets
>>> > > need to get grouped together by subregion.
>>> > >
>>> > > My preferred approach is (3) since it adds the least amount of
>>> > complication
>>> > > to the plotting. I particularly don't like (1) since enforcing a
>>>rule
>>> by
>>> > > convention would add restrictions to users on valid names for
>>> datasets,
>>> > for
>>> > > example a dataset name like 'TRMM_hourly_precip' would make it
>>> difficult
>>> > to
>>> > > incorporate subregions.
>>> > >
>>> > > Mike, my memory since our last meeting is a bit fuzzy so please
>>> clarify
>>> > or
>>> > > correct any of my points if I am wrong here. I would like to hear
>>> other
>>> > > ideas or opinions as to the best approach for the subregion
>>>problem.
>>> > >
>>> > > Thanks,
>>> > > Alex
>>> > >
>>> > > --
>>> > > Alex Goodman
>>> > >
>>> >
>>>
>>

RE: OCW Refactoring and Subregions

Reply via email to