You’re on the right track Tom - I’m just trying to save you having to use the XMLValidationLayer - in reality you want something like that that will accept * patterns.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Tom Barber <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Saturday, April 4, 2015 at 2:37 AM To: "[email protected]" <[email protected]> Subject: Re: Tika Based Metadata Extraction >It seems to me (without looking at the source for chris' examples) that >either its more complex that I imaged or I'm just bad at explaining stuff. > >My understanding of using the crawler, the TikaCmdLineMetExtractor >creates a met file on the fly? > >Within a met file is the metadata associated with a product you are about >to ingest. > >Those met files map to a product mapping file in the filemgr policy area. >So Tika extracts lots of metadata already, so does this get put in the >.met file where I can map it directly to a product-map-element file: > ><type id="urn:oodt:ImageFile"> > <element id="urn:oodt:ProductReceivedTime"/> > <element id="urn:oodt:ProductName"/> > <element id="urn:oodt:ProductId"/> > <element id="urn:oodt:ProductType"/> > <element id="urn:oodt:ProductStructure"/> > <element id="urn:oodt:Filename"/> > <element id="urn:oodt:FileLocation"/> > <element id="urn:oodt:MimeType"/> > <element id="urn:test:DataVersion"/> > <element id="urn:tika:SomejpegData"/> > </type> > >I would have thought that would have made ingestion of extended metadata >without having to write code far easier but I couldn't find and example. > >Clearly by now I could have debugged the source code :) so I guess I'll >do that this evening and see who is correct (or how bad I am at >explaining stuff) > > >Tom > > >On Sat, Apr 04, 2015 at 05:16:53AM +0000, Mattmann, Chris A (3980) wrote: >>The suggestion I have would be to whip up a quick implementation >>of a LenientValidationLayer that takes in a Catalog implementation. >>If it’s the DataSource/MappedDataSource/ScienceData catalog, you: >> >>1. iterate over all product types and then get 1 hit from each, >>getting their metadata, and using that to “infer” what the elements >>are. I would do this statically 1x for each product type and update >>it based on a cache timeout (every 5 mins, or so) >> >>If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should >>be >>able to ask it for the TermVocabulary and/or all the fields present >>in the index. Single call. Easy. >> >>Another way to do it would be to build a Lucene/Solr, and a >>DataSource/Mapped/ >>ScienceData Lenient Val Layer that simple takes a ref to the Catalog >>and/or >>Database, ignores having to go through the Catalog interface, and then >>simply gets the info you need (and lets all fields through and returns >>them the same). >> >>HTH, >>Chris >> >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: [email protected] >>WWW: http://sunset.usc.edu/~mattmann/ >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Adjunct Associate Professor, Computer Science Department >>University of Southern California, Los Angeles, CA 90089 USA >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >>-----Original Message----- >>From: Tom Barber <[email protected]> >>Reply-To: "[email protected]" <[email protected]> >>Date: Friday, April 3, 2015 at 10:31 AM >>To: "[email protected]" <[email protected]> >>Subject: Re: Tika Based Metadata Extraction >> >>>Sorry the product element mapping file in my filemgr policy, by default >>>you >>>have the genericfike policy. So if i run tika app over a jpeg file for >>>example i can see all the exif data etc in fields. Can i just map that >>>to >>>a >>>product type without writing code? >>> >>>Tom >>>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <[email protected]> >>>wrote: >>> >>>> Hi Tom, >>>> >>>> On Friday, April 3, 2015, Tom Barber <[email protected]> wrote: >>>> >>>> > Hello Chaps and Chapesses, >>>> > >>>> > Somehow I've come this far and not done it but I was playing around >>>>with >>>> > the crawler for my ApacheCon demo and came across the >>>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago. >>>> > So I've put some stuff in a folder and can crawl and ingest it using >>>>the >>>> > GenericFile element map, now in the past to map metadata I've >>>>written >>>> some >>>> > class to pump the data around and add to that file, >>>> >>>> >>>> To what file ? >>>> >>>> >>>> > but I was wondering if, as I know what fields are coming out of Tika >>>>to >>>> > just put them into the XML mapping file somehow so I can by pass >>>>having >>>> to >>>> > write Java code? >>>> >>>> >>>> Well Tika will make best effort to pull out as much metadata as >>>>possible. >>>> Chris explains a good bit about this here >>>> >>>> https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help >>>> >>>> I think that if custom extractions are required... You could most >>>>likely >>>> extend the extractor interface and implement it but... This is Java >>>>code >>>> which I assume you are trying to work around? >>>> >>>> >>>> > This may be very obvious in which case I apologise but I can't find >>>>owt >>>> on >>>> > the wiki so I figured I'd ask the gurus. >>>> > >>>> > >>>> >>>> >>>> >>>> -- >>>> *Lewis* >>>> >>
