Hi Jeremy, Thank you very much for your thoughtful reply. The data i need to merge / interpolate is "topobathy" in the sense that i am trying to come up with a solution to do a seamless DEM for topography and bathymetry in a coastal area of Northern Gulf of Mexico. For this task i will have lidar data, DEM, NOAA hydrographic data and USGS bathy data, so in other words anything that will give me an elevation reading for the area of interest. Your detailed explanation gave me some ideas and hopes that maybe it can be done. In your studies, did you ever do an error propagation analysis or a sensitivity analysis after merging / interpolating multiscale multiarea data? Is the error propagation multiplicative or additive? The scale certainly needs to be "natural" to the studied phenomenon, but i am wondering if topography has same scale as bathymetry in the sense that near shore bathy is very sensitive to the impact of extreme storms and hurricanes. So i doubt i can merge data before and after a hurricane ..... which actually may decrease the density of the data i can actually use. In the marsh areas it is notoriously difficult to measure elevation, so that data will have much lower accuracy than other adjacent data. I wonder how this will impact the variance and stationarity assumptions needed for at least some of the interpolation methods. I hope others will have an input on this. Meanwhile if i find any worthy references i will post them to the list in case there is enough interest. Jeremy, thanks for the reference you mentioned, i will check it out. And one more question .... what software(s) do you use to do your analysis? Thanks, Monica
Subject: RE: [R-sig-Geo] merging multiscale dataDate: Thu, 4 Dec 2008 11:25:19 -0500From: [EMAIL PROTECTED]: [EMAIL PROTECTED] This is a problem that has come up in the past in work I do, related to estimation of travel demand from input data that is supposed to be the same thing but has been collected over different geographic and temporal scales, using different methods. One reference that Ive found useful for dealing with such problems (where variations in scale, resolution, accuracy, and completeness in data sets need to be reconciled) is Banerjee, Carlin and Gelfands book Hierarchical Modeling and Analysis for Spatial Data. I recall that they have a pretty good discussion, for example, about how to deal with the data loss that happens at the edges of such an area, where the edges of the datasets are not aligned (see the chapter on spatial misalignment), and some of the other problems mentioned in the post are also explicitly addressed. Ill say some more below about the path I have used in solving a similar problem, but anyone reading this posting needs to understand that Im a user, not a producer of these techniques. Since this kind of problem occurs periodically in my work, Im also interested in hearing about additional references and strategies from those who study such problems regularly and think them all the way through, as opposed to people like me who consume these techniques in a fearless rush on an as-needed basis because we have to process specific data and get it out the door in a hurry, without distorting things too badly. The basic solution strategy Ive pursued is to recognize that the individual datasets each contain a grain of truth (i.e. reliable evidence about what one would, under perfect circumstances, expect to find at a particular location, or over a suitable unit area), but that the truth is veiled behind various random or systematic distortions within the dataset. One needs to sift out the distortions and preserve the good part. In that light, its pretty easy to see that reconciling the data is basically no different than any modeling problem where you have a series of measurements that serve as predictors along with some partial measurements of a response variable youre really interested in. It boils down to finding a best fit for a model that describes the relationship between the measurements (partial, fuzzy, inconsistent, incomplete data) and the quantity of interest (crisp, regular, complete data). The fact that the model outcome purports in this case to be the same quantity that the input variables are supposedly measuring is, practically speaking, irrelevant. The modeling challenge is to analyze how the individual data sets relate to each other in delivering the complete picture, use that analysis to estimate a best fit for the model, test against some data that werent part of the original estimation, do a gut-level (or smell) test on the overall results, try a revised or different model, and iterate until everyone who matters is convinced that the results are good enough. I can feel the serious researchers on this list squirming in pain with that last comment, but thats the difference between working for a department of transportation, rather than a research institution. Its often amazing how truth and accuracy can have such different meanings in these two contexts Obviously, its more work to do such modeling than to pick and choose among the data sets, or to average them in some simple way, and it also requires that one really understand the structure, strengths and limitations of the data sets (which is usually a good thing to do and which, if my own experience is any guide, people are inclined not to do thoroughly enough without some external prodding, such as having to build a working model of the data relationships ). One ends up constructing multi-level models that take each data set as an imperfect representation at a certain level (scale, resolution, extent), and estimating a best fit of that model to produce outcome data at the desired output scale the fitted model is then used to produce a cleaned up data set that has leveled all the available data to a single consistent scale, resolution, and extent. A simple source for the response variable might be places where all the data sets are in reasonable agreement but be careful that the agreement is not just due to those being the most uninteresting locations! Explaining actual code in this case probably wouldnt be useful (plus, it feels better to sound like I might know something, rather than to offer direct evidence that I dont). But heres a (very simplified) overview of how I set up one such problem in my own domain: A regular indicator used in my profession (travel demand modeling) to evaluate the amount of vehicular traffic that is drawn to certain locations is the number of employees working at the location. The process (technically speaking, the attraction end of trip generation) is one that has a natural scale, in that estimates of travel get worse if you try to be too precise (perhaps because unaccounted background processes create variance that, at too small a scale, dominate the local estimate of expected value Im sure theres a better technical explanation of this phenomenon), but if you operate at too large a scale, the resulting picture, while more accurate (i.e. less variable) is too general and uninformative. Think of a picture composed of pixels and what happens when you zoom in too far, or zoom out too far. We try to account for that natural scale by aggregating predictors such as number of employees into geographical areas called traffic analysis zones whose size reflects the best (natural) scale (not too much variation, but also not too general) that scale in our case is usually chosen by rule of thumb rather than explicit statistical analysis (so I dont have much guidance about how to pick the best scale scientifically). Of course, traffic analysis zones (particularly in dense urban areas) usually do not correspond directly to other common units of geography in which data of interest is reported. The sources of employment data themselves are many and inconsistent (and Im entirely leaving out the fact that we usually sub-divide employment into different types such as retail, office, industrial, etc., recognizing that retail establishments attracts more trips (and different kinds of trips) per employee than offices do). So we might use privately collected data from marketing firms (where number of employees is a proxy for the power and presence of the company in a certain market), or from government labor statistics such as records kept for unemployment insurance, or census records that suggest (via journey-to-work statistics) how may people might work in a certain area. Some of these data are presented as point counts at particular workplaces, some are aggregated to geographic regions such as states or counties or economic analysis zones or census tracts. Some of it is simply always going to be missing (try asking the US military how many soldiers they expect to have stationed at a certain military base, or finding out how many employees a certain Mexican restaurant has when most of the people who work there are not, legally speaking, working there). And some of it is systematically distorted: unemployment insurance statistics dont count people who are not eligible for unemployment insurance, some industries such as temporary employment agencies report employment statistics at the headquarters, but the workers are actually spread all over the map; some reporting sources show suspicious spikes in frequency of companies with 50, 100, 500 or 1000 employees; some of the datasets will double-count certain types of employees (e.g. cafeteria workers in a corporate office building who are actually employed and show up as statistics for a food services sub-contractor). Based on studying each of the data sources, and using some survey data that established a true answer (in this case, a level of employment verified by hand that would generate answers with our given trip rates that matched traffic counts at a cordon around a small set of study locations over the period of interest) I set up a model that treated each data source as a particular guess (with some adjustment factors such as an estimated correction for contract workers based on surveys of businesses in certain industries) and then estimated the model to merge weighted components of the various data sets into the geographic unit of traffic analysis zones. I wont pretend to have done this in practice with near the thoroughness that the problem merits, but the results seemed reasonable at the time and they could in principle be adjusted consistently as data sources are updated. Good luck with your problem! -- Jeremy Raw, P.E., AICP Senior Modeling Systems Analyst | Transportation and Mobility Planning | Virginia [EMAIL PROTECTED] | Desk: 804-786-0998 | Fax: 804-225-4785 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Monica PisicaSent: Wednesday, December 03, 2008 4:50 PMTo: [EMAIL PROTECTED]: [R-sig-Geo] merging multiscale data Hi everybody, I am very much interested in hearing what is your experience with merging different scale data to obtain a uniform surface at the finest scale possible, and if not at the most optimum scale. Now I know that "optimum" is in the eye of the beholder, but maybe we can agree on a definition of optimum scale. Anyway, suppose I have an area (any area) for which I have several sets of data, raster and vector, at different scales and different spatial extents. Some overlap, some not, but the whole area is rather covered by data. Also suppose that all datasets are in the same datum / projection, so we are not concerned about it. How do I get about to merge all of that??? One idea might be to transform everything in xyz coordinates and interpolate the area and get different surfaces for different resolutions and compare the RMSE and choose the one with the smallest RMSE. Is that an option??? What about if I know the accuracy of each dataset . How error will propagate then? What if the area is so big that the computer I work on will not be able to do the interpolation, so I will be forced to work on tiles, how do I elliminate the border differences? If you have any opinion or any good reference to work with I will really appreciate. Thanks, Monica Send e-mail anywhere. No map, no compass. Get your Hotmail® account now. _________________________________________________________________ Send e-mail faster without improving your typing skills. d_122008 [[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/r-sig-geo
