Re: [R-sig-Geo] merging multiscale data

Monica Pisica Thu, 04 Dec 2008 09:43:52 -0800

Hi Jeremy,
 
Thank you very much for your thoughtful reply. The data i need to merge / 
interpolate is "topobathy" in the sense that i am trying to come up with a 
solution to do a seamless DEM for topography and bathymetry in a coastal area 
of Northern Gulf of Mexico. For this task i will have lidar data, DEM, NOAA 
hydrographic data and USGS bathy data, so in other words anything that will 
give me an elevation reading for the area of interest. Your detailed 
explanation gave me some ideas and hopes that maybe it can be done. In your 
studies, did you ever do an error propagation analysis or a sensitivity 
analysis after merging / interpolating multiscale multiarea data? Is the error 
propagation multiplicative or additive?
 
The scale certainly needs to be "natural" to the studied phenomenon, but i am 
wondering if topography has same scale as bathymetry in the sense that near 
shore bathy is very sensitive to the impact of extreme storms and hurricanes. 
So i doubt i can merge data before and after a hurricane ..... which actually 
may decrease the density of the data i can actually use. In the marsh areas it 
is notoriously difficult to measure elevation, so that data will have much 
lower accuracy than other adjacent data. I wonder how this will impact the 
variance and stationarity assumptions needed for at least some of the 
interpolation methods. 
 
I hope others will have an input on this. Meanwhile if i find any worthy 
references i will post them to the list in case there is enough interest. 
Jeremy, thanks for the reference you mentioned, i will check it out. And one 
more question .... what software(s) do you use to do your analysis?
 
Thanks,
 
Monica




Subject: RE: [R-sig-Geo] merging multiscale dataDate: Thu, 4 Dec 2008 11:25:19 
-0500From: [EMAIL PROTECTED]: [EMAIL PROTECTED]







This is a problem that has come up in the past in work I do, related to 
estimation of travel demand from input data that is supposed to be the same 
thing but has been collected over different geographic and temporal scales, 
using different methods.  One reference that Ive found useful for dealing with 
such problems (where variations in scale, resolution, accuracy, and 
completeness in data sets need to be reconciled) is Banerjee, Carlin and 
Gelfands book Hierarchical Modeling and Analysis for Spatial Data.  I recall 
that they have a pretty good discussion, for example, about how to deal with 
the data loss that happens at the edges of such an area, where the edges of the 
datasets are not aligned (see the chapter on spatial misalignment), and some of 
the other problems mentioned in the post are also explicitly addressed.
 
Ill say some more below about the path I have used in solving a similar 
problem, but anyone reading this posting needs to understand that Im a user, 
not a producer of these techniques.  Since this kind of problem occurs 
periodically in my work, Im also interested in hearing about additional 
references and strategies from those who study such problems regularly and 
think them all the way through, as opposed to people like me who consume these 
techniques in a fearless rush on an as-needed basis because we have to process 
specific data and get it out the door in a hurry, without distorting things too 
badly.
 
The basic solution strategy Ive pursued is to recognize that the individual 
datasets each contain a grain of truth (i.e. reliable evidence about what one 
would, under perfect circumstances, expect to find at a particular location, or 
over a suitable unit area), but that the truth is veiled behind various random 
or systematic distortions within the dataset.  One needs to sift out the 
distortions and preserve the good part.  In that light, its pretty easy to 
see that reconciling the data is basically no different than any modeling 
problem where you have a series of measurements that serve as predictors along 
with some partial measurements of a response variable youre really interested 
in.  It boils down to finding a best fit for a model that describes the 
relationship between the measurements (partial, fuzzy, inconsistent, incomplete 
data) and the quantity of interest (crisp, regular, complete data).  The fact 
that the model outcome purports in this case to be the same quantity that the 
input variables are supposedly measuring is, practically speaking, irrelevant.
 
The modeling challenge is to analyze how the individual data sets relate to 
each other in delivering the complete picture, use that analysis to estimate a 
best fit for the model, test against some data that werent part of the 
original estimation, do a gut-level (or smell) test on the overall results, 
try a revised or different model, and iterate until everyone who matters is 
convinced that the results are good enough.  I can feel the serious researchers 
on this list squirming in pain with that last comment, but thats the 
difference between working for a department of transportation, rather than a 
research institution.  Its often amazing how truth and accuracy can have 
such different meanings in these two contexts
 
Obviously, its more work to do such modeling than to pick and choose among the 
data sets, or to average them in some simple way, and it also requires that one 
really understand the structure, strengths and limitations of the data sets 
(which is usually a good thing to do and which, if my own experience is any 
guide, people are inclined not to do thoroughly enough without some external 
prodding, such as having to build a working model of the data relationships).
 
One ends up constructing multi-level models that take each data set as an 
imperfect representation at a certain level (scale, resolution, extent), and 
estimating a best fit of that model to produce outcome data at the desired 
output scale  the fitted model is then used to produce a cleaned up data set 
that has leveled all the available data to a single consistent scale, 
resolution, and extent.  A simple source for the response variable might be 
places where all the data sets are in reasonable agreement  but be careful 
that the agreement is not just due to those being the most uninteresting 
locations!
 
Explaining actual code in this case probably wouldnt be useful (plus, it feels 
better to sound like I might know something, rather than to offer direct 
evidence that I dont).  But heres a (very simplified) overview of how I set 
up one such problem in my own domain:
 
A regular indicator used in my profession (travel demand modeling) to evaluate 
the amount of vehicular traffic that is drawn to certain locations is the 
number of employees working at the location.  The process (technically 
speaking, the attraction end of trip generation) is one that has a natural 
scale, in that estimates of travel get worse if you try to be too precise 
(perhaps because unaccounted background processes create variance that, at too 
small a scale, dominate the local estimate of expected value  Im sure theres 
a better technical explanation of this phenomenon), but if you operate at too 
large a scale, the resulting picture, while more accurate (i.e. less 
variable) is too general and uninformative.   Think of a picture composed of 
pixels and what happens when you zoom in too far, or zoom out too far.  We try 
to account for that natural scale by aggregating predictors such as number of 
employees into geographical areas called traffic analysis zones whose size 
reflects the best (natural) scale (not too much variation, but also not too 
general)  that scale in our case is usually chosen by rule of thumb rather 
than explicit statistical analysis (so I dont have much guidance about how to 
pick the best scale scientifically).   Of course, traffic analysis zones 
(particularly in dense urban areas) usually do not correspond directly to other 
common units of geography in which data of interest is reported.
 
The sources of employment data themselves are many and inconsistent (and Im 
entirely leaving out the fact that we usually sub-divide employment into 
different types such as retail, office, industrial, etc., recognizing that 
retail establishments attracts more trips (and different kinds of trips) per 
employee than offices do).  So we might use privately collected data from 
marketing firms (where number of employees is a proxy for the power and 
presence of the company in a certain market), or from government labor 
statistics such as records kept for unemployment insurance, or census records 
that suggest (via journey-to-work statistics) how may people might work in a 
certain area.  Some of these data are presented as point counts at particular 
workplaces, some are aggregated to geographic regions such as states or 
counties or economic analysis zones or census tracts.  Some of it is simply 
always going to be missing (try asking the US military how many soldiers they 
expect to have stationed at a certain military base, or finding out how many 
employees a certain Mexican restaurant has when most of the people who work 
there are not, legally speaking, working there).  And some of it is 
systematically distorted:  unemployment insurance statistics dont count people 
who are not eligible for unemployment insurance, some industries such as 
temporary employment agencies report employment statistics at the headquarters, 
but the workers are actually spread all over the map; some reporting sources 
show suspicious spikes in frequency of companies with 50, 100, 500 or 1000 
employees; some of the datasets will double-count certain types of employees 
(e.g. cafeteria workers in a corporate office building who are actually 
employed  and show up as statistics for  a food services sub-contractor).
 
Based on studying each of the data sources, and using some survey data that 
established a true answer (in this case, a level of employment verified by 
hand that would generate answers with our given trip rates that matched traffic 
counts at a cordon around a small set of study locations over the period of 
interest) I set up a model that treated each data source as a particular guess 
(with some adjustment factors such as an estimated correction for contract 
workers based on surveys of businesses in certain industries) and then 
estimated the model to merge weighted components of the various data sets into 
the geographic unit of traffic analysis zones.  I wont pretend to have done 
this in practice with near the thoroughness that the problem merits, but the 
results seemed reasonable at the time and they could in principle be adjusted 
consistently as data sources are updated.
 
Good luck with your problem!

-- Jeremy Raw, P.E., AICP Senior Modeling Systems Analyst | Transportation and 
Mobility Planning | Virginia [EMAIL PROTECTED] | Desk: 804-786-0998 | Fax: 
804-225-4785




From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Monica 
PisicaSent: Wednesday, December 03, 2008 4:50 PMTo: [EMAIL PROTECTED]: 
[R-sig-Geo] merging multiscale data
 
Hi everybody,
 
I am very much interested in hearing what is your experience with merging 
different scale data to obtain a uniform surface at the finest scale possible, 
and if not at the most optimum scale. Now I know that "optimum" is in the eye 
of the beholder, but maybe we can agree on a definition of optimum scale.
 
Anyway, suppose I have an area (any area) for which I have several sets of 
data, raster and vector, at different scales and different spatial extents. 
Some overlap, some not, but the whole area is rather covered by data. Also 
suppose that all datasets are in the same datum / projection, so we are not 
concerned about it. How do I get about to merge all of that??? 
 
One idea might be to transform everything in xyz coordinates and interpolate 
the area and get different surfaces for different resolutions and compare the 
RMSE and choose the one with the smallest RMSE. Is that an option??? What about 
if I know the accuracy of each dataset . How error will propagate then? What 
if the area is so big that the computer I work on will not be able to do the 
interpolation, so I will be forced to work on tiles, how do I elliminate the 
border differences? 
 
If you have any opinion or any good reference to work with I will really 
appreciate.
 
Thanks,
 
Monica
 



Send e-mail anywhere. No map, no compass. Get your Hotmail® account now.
_________________________________________________________________
Send e-mail faster without improving your typing skills.

d_122008
        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Re: [R-sig-Geo] merging multiscale data

Reply via email to