The suggestion I have would be to whip up a quick implementation of a LenientValidationLayer that takes in a Catalog implementation. If it’s the DataSource/MappedDataSource/ScienceData catalog, you:
1. iterate over all product types and then get 1 hit from each, getting their metadata, and using that to “infer” what the elements are. I would do this statically 1x for each product type and update it based on a cache timeout (every 5 mins, or so) If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should be able to ask it for the TermVocabulary and/or all the fields present in the index. Single call. Easy. Another way to do it would be to build a Lucene/Solr, and a DataSource/Mapped/ ScienceData Lenient Val Layer that simple takes a ref to the Catalog and/or Database, ignores having to go through the Catalog interface, and then simply gets the info you need (and lets all fields through and returns them the same). HTH, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Tom Barber <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, April 3, 2015 at 10:31 AM To: "[email protected]" <[email protected]> Subject: Re: Tika Based Metadata Extraction >Sorry the product element mapping file in my filemgr policy, by default >you >have the genericfike policy. So if i run tika app over a jpeg file for >example i can see all the exif data etc in fields. Can i just map that to >a >product type without writing code? > >Tom >On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <[email protected]> >wrote: > >> Hi Tom, >> >> On Friday, April 3, 2015, Tom Barber <[email protected]> wrote: >> >> > Hello Chaps and Chapesses, >> > >> > Somehow I've come this far and not done it but I was playing around >>with >> > the crawler for my ApacheCon demo and came across the >> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago. >> > So I've put some stuff in a folder and can crawl and ingest it using >>the >> > GenericFile element map, now in the past to map metadata I've written >> some >> > class to pump the data around and add to that file, >> >> >> To what file ? >> >> >> > but I was wondering if, as I know what fields are coming out of Tika >>to >> > just put them into the XML mapping file somehow so I can by pass >>having >> to >> > write Java code? >> >> >> Well Tika will make best effort to pull out as much metadata as >>possible. >> Chris explains a good bit about this here >> >> https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help >> >> I think that if custom extractions are required... You could most likely >> extend the extractor interface and implement it but... This is Java code >> which I assume you are trying to work around? >> >> >> > This may be very obvious in which case I apologise but I can't find >>owt >> on >> > the wiki so I figured I'd ask the gurus. >> > >> > >> >> >> >> -- >> *Lewis* >>
