Hi, Chris, I would agree that we probably should come up with a more comprehensive solution for this wrt the metadata object and the resulting XHTML. That would make this feel a little more like the geospatial stuff is more of a first class citizen in the metadata hierarchy.
We will probably need to support more coordinate systems than just WGS 84, as there are a number of systems that either have no transformation to WGS 84. The encoding of the WKT is also pretty important. Would you rather break it down to it's component parts, probably datum and projection for starters, or leave it whole? Obviously, the more metadata we have, the more powerful Tika becomes, but there is a point where you have too much data that is not as useful. On another note, I took a look at the code for your 605 patch, and I have a suggestion. Reading the notes on the checkins for the patch, I noticed that no one had suggested using the in-memory Dataset as the default type. There is no reason why the stream used to open the Tika parser could not be used to fill a buffer with the file data, and then use that to create a dataset. As it is, I'm trying to get GDAL to cooperate with me on my Mac. Being a newcomer to Mac seems to be a drawback when trying to be productive. It just takes a little more fight to get the bits to do what I really want. In any case, once I get GDAL whipped into shape, I'll see if I can't get a test file to recognize any geospatial data, and then we will be off and running. Thanks Joe On Feb 26, 2012, at 1:10 PM, Mattmann, Chris A (388J) wrote: > Hi Joe, > > Awesome! Thanks for picking this up and getting interested in this work. > Right now, the only use cases we've had so far > is to represent lats and lons (WGS84). It would be great to extract more > information and come up with a policy for representing > more WKTs and so forth. We should probably start by coming up with a scheme > for encoding the extracted information in the > Tika metadata object and in its output XHTML. Do you have any ideas about how > to do that? Right now in the existing patch > on TIKA-605, I simply was intended to use the met object and its > key-multi-value structure to represent the extracted information > but to take advantage of streaming and of content handlers, we ought to > encode this information in the output XHTML. > > Thoughts? > > Cheers, > Chris > > On Feb 26, 2012, at 9:39 AM, Joe White wrote: > >> Hi, >> I'm looking into implementing a bridge/link between Tika and GDAL so that >> geospatial information can be saved from georeferenced images and vector >> types. One thing that I have noticed while going through the code is that >> the code only defines geographic coordinate types, using latitudes and >> longitudes. Is this by design? If GDAL is wrapped into Tika, and a >> projected image is imported, are the geospatial extents meant to be held in >> the metadata as geographic points, possibly as WGS 84? >> >> Thanks >> >> Joe White > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >
