[I originally sent this message by accident just to the original poster,
which appears to be leading to an interesting discussion, but there's a
question of my own embedded somewhere down there about additional
resources, so I figured I'd better send it to everyone else too :-) ]

 

This is a problem that has come up in the past in work I do, related to
estimation of travel demand from input data that is supposed to be "the
same thing" but has been collected over different geographic and
temporal scales, using different methods.  One reference that I've found
useful for dealing with such problems (where variations in scale,
resolution, accuracy, and completeness in data sets need to be
reconciled) is Banerjee, Carlin and Gelfand's book "Hierarchical
Modeling and Analysis for Spatial Data".  I recall that they have a
pretty good discussion, for example, about how to deal with the data
loss that happens at the edges of such an area, where the edges of the
datasets are not aligned (see the chapter on spatial misalignment), and
some of the other problems mentioned in the post are also explicitly
addressed.

 

I'll say some more below about the path I have used in solving a similar
problem, but anyone reading this posting needs to understand that I'm a
"user", not a "producer" of these techniques.  Since this kind of
problem occurs periodically in my work, I'm also interested in hearing
about additional references and strategies from those who study such
problems regularly and think them all the way through, as opposed to
people like me who consume these techniques in a fearless rush on an
as-needed basis because we have to process specific data and get it out
the door in a hurry, without distorting things too badly.

 

The basic solution strategy I've pursued is to recognize that the
individual datasets each contain a grain of "truth" (i.e. reliable
evidence about what one would, under perfect circumstances, expect to
find at a particular location, or over a suitable unit area), but that
the truth is veiled behind various random or systematic distortions
within the dataset.  One needs to sift out the distortions and preserve
the "good" part.  In that light, it's pretty easy to see that
reconciling the data is basically no different than any modeling problem
where you have a series of measurements that serve as predictors along
with some partial measurements of a response variable you're really
interested in.  It boils down to finding a best fit for a model that
describes the relationship between the measurements (partial, fuzzy,
inconsistent, incomplete data) and the quantity of interest (crisp,
regular, complete data).  The fact that the model outcome purports in
this case to be the same quantity that the input variables are
supposedly measuring is, practically speaking, irrelevant.

 

The modeling challenge is to analyze how the individual data sets relate
to each other in delivering the complete picture, use that analysis to
estimate a best fit for the model, test against some data that weren't
part of the original estimation, do a gut-level (or "smell") test on the
overall results, try a revised or different model, and iterate until
everyone who matters is convinced that the results are good enough.  I
can feel the serious researchers on this list squirming in pain with
that last comment, but that's the difference between working for a
department of transportation, rather than a research institution.  It's
often amazing how "truth" and "accuracy" can have such different
meanings in these two contexts...

 

Obviously, it's more work to do such modeling than to pick and choose
among the data sets, or to average them in some simple way, and it also
requires that one really understand the structure, strengths and
limitations of the data sets (which is usually a good thing to do and
which, if my own experience is any guide, people are inclined not to do
thoroughly enough without some external prodding, such as having to
build a working model of the data relationships...).

 

One ends up constructing multi-level models that take each data set as
an imperfect representation at a certain "level" (scale, resolution,
extent), and estimating a "best fit" of that model to produce outcome
data at the desired output scale - the fitted model is then used to
produce a cleaned up data set that has leveled all the available data to
a single consistent scale, resolution, and extent.  A simple source for
the response variable might be places where all the data sets are in
reasonable agreement - but be careful that the agreement is not just due
to those being the most uninteresting locations!

 

Explaining actual code in this case probably wouldn't be useful (plus,
it feels better to sound like I might know something, rather than to
offer direct evidence that I don't).  But here's a (very simplified)
overview of how I set up one such problem in my own domain:

 

A regular indicator used in my profession (travel demand modeling) to
evaluate the amount of vehicular traffic that is drawn to certain
locations is the number of employees working at the location.  The
process (technically speaking, the attraction end of trip generation) is
one that has a "natural" scale, in that estimates of travel get worse if
you try to be too precise (perhaps because unaccounted background
processes create variance that, at too small a scale, dominate the local
estimate of expected value - I'm sure there's a better technical
explanation of this phenomenon), but if you operate at too large a
scale, the resulting picture, while more "accurate" (i.e. less variable)
is too general and uninformative.   Think of a picture composed of
pixels and what happens when you zoom in too far, or zoom out too far.
We try to account for that natural scale by aggregating predictors such
as number of employees into geographical areas called traffic analysis
zones whose size reflects the "best" (natural) scale (not too much
variation, but also not too general) - that scale in our case is usually
chosen by rule of thumb rather than explicit statistical analysis (so I
don't have much guidance about how to pick the best scale
scientifically).   Of course, traffic analysis zones (particularly in
dense urban areas) usually do not correspond directly to other common
units of geography in which data of interest is reported.

 

The sources of employment data themselves are many and inconsistent (and
I'm entirely leaving out the fact that we usually sub-divide employment
into different types such as retail, office, industrial, etc.,
recognizing that retail establishments attracts more trips (and
different kinds of trips) per employee than offices do).  So we might
use privately collected data from marketing firms (where number of
employees is a proxy for the power and presence of the company in a
certain market), or from government labor statistics such as records
kept for unemployment insurance, or census records that suggest (via
journey-to-work statistics) how may people might work in a certain area.
Some of these data are presented as point counts at particular
workplaces, some are aggregated to geographic regions such as states or
counties or economic analysis zones or census tracts.  Some of it is
simply always going to be missing (try asking the US military how many
soldiers they expect to have stationed at a certain military base, or
finding out how many employees a certain Mexican restaurant has when
most of the people who work there are not, legally speaking, working
there).  And some of it is systematically distorted:  unemployment
insurance statistics don't count people who are not eligible for
unemployment insurance, some industries such as temporary employment
agencies report employment statistics at the headquarters, but the
workers are actually spread all over the map; some reporting sources
show suspicious spikes in frequency of companies with 50, 100, 500 or
1000 employees; some of the datasets will double-count certain types of
employees (e.g. cafeteria workers in a corporate office building who are
actually employed - and show up as statistics for - a food services
sub-contractor).

 

Based on studying each of the data sources, and using some survey data
that established a "true" answer (in this case, a level of employment
verified by hand that would generate answers with our given trip rates
that matched traffic counts at a cordon around a small set of study
locations over the period of interest) I set up a model that treated
each data source as a particular guess (with some adjustment factors
such as an estimated correction for contract workers based on surveys of
businesses in certain industries) and then estimated the model to merge
weighted components of the various data sets into the geographic unit of
traffic analysis zones.  I won't pretend to have done this in practice
with near the thoroughness that the problem merits, but the results
seemed reasonable at the time and they could in principle be adjusted
consistently as data sources are updated.

 

Good luck with your problem!

-- 
Jeremy Raw, P.E., AICP 
Senior Modeling Systems Analyst | Transportation and Mobility Planning |
Virginia DOT
[EMAIL PROTECTED] | Desk: 804-786-0998 | Fax: 804-225-4785

________________________________

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Monica Pisica
Sent: Wednesday, December 03, 2008 4:50 PM
To: [email protected]
Subject: [R-sig-Geo] merging multiscale data



Hi everybody,



I am very much interested in hearing what is your experience with
merging different scale data to obtain a uniform surface at the finest
scale possible, and if not at the most optimum scale. Now I know that
"optimum" is in the eye of the beholder, but maybe we can agree on a
definition of optimum scale.



Anyway, suppose I have an area (any area) for which I have several sets
of data, raster and vector, at different scales and different spatial
extents. Some overlap, some not, but the whole area is rather covered by
data. Also suppose that all datasets are in the same datum / projection,
so we are not concerned about it. How do I get about to merge all of
that??? 



One idea might be to transform everything in xyz coordinates and
interpolate the area and get different surfaces for different
resolutions and compare the RMSE and choose the one with the smallest
RMSE. Is that an option??? What about if I know the accuracy of each
dataset .... How error will propagate then? What if the area is so big
that the computer I work on will not be able to do the interpolation, so
I will be forced to work on tiles, how do I elliminate the border
differences? 



If you have any opinion or any good reference to work with I will really
appreciate.



Thanks,



Monica



________________________________

Send e-mail anywhere. No map, no compass. Get your Hotmail(r) account
now.
<http://windowslive.com/Explore/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_an
ywhere_122008> 


        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Reply via email to