Re: Tika Based Metadata Extraction

Tom Barber Sat, 04 Apr 2015 02:40:22 -0700

It seems to me (without looking at the source for chris' examples) that either 
its more complex that I imaged or I'm just bad at explaining stuff.


My understanding of  using the crawler, the TikaCmdLineMetExtractor creates a 
met file on the fly?

Within a met file is the metadata associated with a product you are about to 
ingest.

Those met files map to a product mapping file in the filemgr policy area. So 
Tika extracts lots of metadata already, so does this get put in the .met file 
where I can map it directly to a product-map-element file:

<type id="urn:oodt:ImageFile">
        <element id="urn:oodt:ProductReceivedTime"/>
        <element id="urn:oodt:ProductName"/>
        <element id="urn:oodt:ProductId"/>
        <element id="urn:oodt:ProductType"/>
        <element id="urn:oodt:ProductStructure"/>
        <element id="urn:oodt:Filename"/>
        <element id="urn:oodt:FileLocation"/>
        <element id="urn:oodt:MimeType"/>
        <element id="urn:test:DataVersion"/>
        <element id="urn:tika:SomejpegData"/>
    </type>

I would have thought that would have made ingestion of extended metadata 
without having to write code far easier but I couldn't find and example.

Clearly by now I could have debugged the source code :) so I guess I'll do that 
this evening and see who is correct (or how bad I am at explaining stuff)


Tom


On Sat, Apr 04, 2015 at 05:16:53AM +0000, Mattmann, Chris A (3980) wrote:

The suggestion I have would be to whip up a quick implementation
of a LenientValidationLayer that takes in a Catalog implementation.
If it’s the DataSource/MappedDataSource/ScienceData catalog, you:

1. iterate over all product types and then get 1 hit from each,
getting their metadata, and using that to “infer” what the elements
are. I would do this statically 1x for each product type and update
it based on a cache timeout (every 5 mins, or so)

If it’s the LuceneCatalog / SolrCatalog, yay, it’s Lucene, and you should
be
able to ask it for the TermVocabulary and/or all the fields present
in the index. Single call. Easy.

Another way to do it would be to build a Lucene/Solr, and a
DataSource/Mapped/
ScienceData Lenient Val Layer that simple takes a ref to the Catalog and/or
Database, ignores having to go through the Catalog interface, and then
simply gets the info you need (and lets all fields through and returns
them the same).

HTH,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Tom Barber <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, April 3, 2015 at 10:31 AM
To: "[email protected]" <[email protected]>
Subject: Re: Tika Based Metadata Extraction

Sorry the product element mapping file in my filemgr policy, by default
you
have the genericfike policy. So if i run tika app over  a jpeg file for
example i can see all the exif data etc in fields. Can i just map that to
a
product type without writing code?

Tom
On 3 Apr 2015 18:02, "Lewis John Mcgibbney" <[email protected]>
wrote:

Hi Tom,

On Friday, April 3, 2015, Tom Barber <[email protected]> wrote:

> Hello Chaps and Chapesses,
>
> Somehow I've come this far and not done it but I was playing around
with
> the crawler for my ApacheCon demo and came across the
> TikaCmdLineMetExtractor that Rishi I believe wrote a while ago.
> So I've put some stuff in a folder and can crawl and ingest it using
the
> GenericFile element map, now in the past to map metadata I've written
some
> class to pump the data around and add to that file,

To what file ?

> but I was wondering if, as I know what fields are coming out of Tika
to
> just put them into the XML mapping file somehow so I can by pass
having
to
> write Java code?

Well Tika will make best effort to pull out as much metadata as
possible.
Chris explains a good bit about this here

 https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help

I think that if custom extractions are required... You could most likely
extend the extractor interface and implement it but... This is Java code
which I assume you are trying to work around?

> This may be very obvious in which case I apologise but I can't find
owt
on
> the wiki so I figured I'd ask the gurus.
>
>

--
*Lewis*

Re: Tika Based Metadata Extraction

Reply via email to