Re: Tika Based Metadata Extraction

Mattmann, Chris A (3980) Fri, 03 Apr 2015 22:21:51 -0700

The suggestion I have would be to whip up a quick implementation
of a LenientValidationLayer that takes in a Catalog implementation.
If it’s the DataSource/MappedDataSource/ScienceData catalog, you:


1. iterate over all product types and then get 1 hit from each,
getting their metadata, and using that to “infer” what the elements
are. I would do this statically 1x for each product type and update
it based on a cache timeout (every 5 mins, or so)

If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
be 
able to ask it for the TermVocabulary and/or all the fields present
in the index. Single call. Easy.

Another way to do it would be to build a Lucene/Solr, and a
DataSource/Mapped/
ScienceData Lenient Val Layer that simple takes a ref to the Catalog and/or
Database, ignores having to go through the Catalog interface, and then
simply gets the info you need (and lets all fields through and returns
them the same).

HTH,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Tom Barber <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, April 3, 2015 at 10:31 AM
To: "[email protected]" <[email protected]>
Subject: Re: Tika Based Metadata Extraction

>Sorry the product element mapping file in my filemgr policy, by default
>you
>have the genericfike policy. So if i run tika app over  a jpeg file for
>example i can see all the exif data etc in fields. Can i just map that to
>a
>product type without writing code?
>
>Tom
>On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <[email protected]>
>wrote:
>
>> Hi Tom,
>>
>> On Friday, April 3, 2015, Tom Barber <[email protected]> wrote:
>>
>> > Hello Chaps and Chapesses,
>> >
>> > Somehow I've come this far and not done it but I was playing around
>>with
>> > the crawler for my ApacheCon demo and came across the
>> > TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
>> > So I've put some stuff in a folder and can crawl and ingest it using
>>the
>> > GenericFile element map, now in the past to map metadata I've written
>> some
>> > class to pump the data around and add to that file,
>>
>>
>> To what file ?
>>
>>
>> > but I was wondering if, as I know what fields are coming out of Tika
>>to
>> > just put them into the XML mapping file somehow so I can by pass
>>having
>> to
>> > write Java code?
>>
>>
>> Well Tika will make best effort to pull out as much metadata as
>>possible.
>> Chris explains a good bit about this here
>>
>>  https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>>
>> I think that if custom extractions are required... You could most likely
>> extend the extractor interface and implement it but... This is Java code
>> which I assume you are trying to work around?
>>
>>
>> > This may be very obvious in which case I apologise but I can't find
>>owt
>> on
>> > the wiki so I figured I'd ask the gurus.
>> >
>> >
>>
>>
>>
>> --
>> *Lewis*
>>

Re: Tika Based Metadata Extraction

Reply via email to