good question, I think Rishi wrote that extractor, so you may
want to ask him or just check the code. Would be a welcome improvement
if it’s not there.

org.apache.oodt.cas.metadata.extractors.tika.fieldExcludeList

-C

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Tom Barber <tom.bar...@meteorite.bi>
Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Date: Monday, November 23, 2015 at 4:08 PM
To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Subject: Re: Crawling / Archiving binary data with Solr backend

>okay then so it seems my phone writes some binary junk to a user comment
>field. I don't really plan to use phone images, but what would be good
>using the tika met extractor is to block certain fields in my tika.conf is
>that possible?
>
>On Mon, Nov 23, 2015 at 7:29 PM, Tom Barber <tom.bar...@meteorite.bi>
>wrote:
>
>> Ah ha. Think i've figured it out. The image has binary data in it,
>>because
>> that fails with the filemgr, so thats one failure. The mp3 failed
>>because
>> there was a space in the filename, but it appears the crawler can't cope
>> with such trickery!
>>
>> On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.bar...@meteorite.bi>
>> wrote:
>>
>>> filed jira, i'll finish my UI and workflow off for wednesday then
>>>circle
>>> back to it when I have 10 minutes to debug and see if its a quick
>>> fix/config issue. Looks like its failing to decode binary data though
>>>to me.
>>>
>>> Tom
>>>
>>> On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.bar...@meteorite.bi>
>>> wrote:
>>>
>>>>  Booooo
>>>>
>>>> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <
>>>> chris.mattm...@gmail.com> wrote:
>>>>
>>>>> yep, agreed.
>>>>>
>>>>> —
>>>>> Chris Mattmann
>>>>> chris.mattm...@gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Tom Barber <tom.bar...@meteorite.bi>
>>>>> Reply-To: <dev@oodt.apache.org>
>>>>> Date: Monday, November 23, 2015 at 9:06 AM
>>>>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>>>>
>>>>> >Dumping a .met file and calling the filemgr client ingest routine
>>>>>works
>>>>> >fine, so its something either broken or i'm doing wrong in the
>>>>>crawler
>>>>> it
>>>>> >appears.
>>>>> >
>>>>> >Tom
>>>>> >
>>>>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber
>>>>><tom.bar...@meteorite.bi>
>>>>> >wrote:
>>>>> >
>>>>> >> I'll give it a go. Thanks.
>>>>> >>
>>>>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann
>>>>> >><chris.mattm...@gmail.com>
>>>>> >> wrote:
>>>>> >>
>>>>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file
>>>>> >>> using TikaCmdLine extractor and then use that metadata file
>>>>> >>> to ingest into File Manager by hand? Does that work?
>>>>> >>>
>>>>> >>> —
>>>>> >>> Chris Mattmann
>>>>> >>> chris.mattm...@gmail.com
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> -----Original Message-----
>>>>> >>> From: Tom Barber <tom.bar...@meteorite.bi>
>>>>> >>> Reply-To: <dev@oodt.apache.org>
>>>>> >>> Date: Monday, November 23, 2015 at 7:43 AM
>>>>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>>>> >>>
>>>>> >>> >Author: Alun Davis - Loudmouth
>>>>> >>> >Content-Length: 3273160
>>>>> >>> >Content-Type: audio/mpeg
>>>>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser
>>>>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
>>>>> >>> >X-TIKA:digest:SHA256:
>>>>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
>>>>> >>> >channels: 2
>>>>> >>> >creator: Alun Davis - Loudmouth
>>>>> >>> >dc:creator: Alun Davis - Loudmouth
>>>>> >>> >dc:title: Teenage Baghead
>>>>> >>> >meta:author: Alun Davis - Loudmouth
>>>>> >>> >resourceName: Teenage Baghead.mp3
>>>>> >>> >samplerate: 44100
>>>>> >>> >title: Teenage Baghead
>>>>> >>> >version: MPEG 3 Layer III Version 1
>>>>> >>> >xmpDM:album:
>>>>> >>> >xmpDM:artist: Alun Davis - Loudmouth
>>>>> >>> >xmpDM:audioChannelType: Stereo
>>>>> >>> >xmpDM:audioCompressor: MP3
>>>>> >>> >xmpDM:audioSampleRate: 44100
>>>>> >>> >xmpDM:duration: 204577.046875
>>>>> >>> >xmpDM:genre: Pop
>>>>> >>> >xmpDM:logComment: www.maimthattune.com for more!
>>>>> >>> >xmpDM:releaseDate: 2001
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >Nothing that should scare a parser in the mp3 at least.
>>>>> >>> >
>>>>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann <
>>>>> >>> chris.mattm...@gmail.com>
>>>>> >>> >wrote:
>>>>> >>> >
>>>>> >>> >> yeah check the metadata. Any weird UTF-8 encoding?
>>>>> >>> >>
>>>>> >>> >> (aka run tika on the file outside of OODT what do you see?)
>>>>> >>> >>
>>>>> >>> >> —
>>>>> >>> >> Chris Mattmann
>>>>> >>> >> chris.mattm...@gmail.com
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> -----Original Message-----
>>>>> >>> >> From: Tom Barber <tom.bar...@meteorite.bi>
>>>>> >>> >> Reply-To: <dev@oodt.apache.org>
>>>>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM
>>>>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>>> >>> >> Subject: Re: Crawling / Archiving binary data with Solr
>>>>>backend
>>>>> >>> >>
>>>>> >>> >> >./crawler/bin/crawler_launcher     --filemgrUrl
>>>>> >>>http://localhost:9000
>>>>> >>> >> >--operation --launchMetCrawler     --clientTransferer
>>>>> >>> >>
>>>>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
>>>>> >>> >> >--productPath $OODT_HOME/data/staging     --metExtractor
>>>>> >>> >> 
>>>>>>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
>>>>> >>> >> >--metExtractorConfig
>>>>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
>>>>> >>> >> >
>>>>> >>> >> >I'm running that. Which runs fine with the default lucene
>>>>>stuff,
>>>>> >>>also
>>>>> >>> >>runs
>>>>> >>> >> >fine with a txt file, but doesn't run fine over a random
>>>>> picture I
>>>>> >>> >>took or
>>>>> >>> >> >over an mp3 I tested it on.
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) <
>>>>> >>> >> >chris.a.mattm...@jpl.nasa.gov> wrote:
>>>>> >>> >> >
>>>>> >>> >> >> Encoding issues with the extracted metadata? What are you
>>>>> getting
>>>>> >>> >> >> just running Tika on the files?
>>>>> >>> >> >>
>>>>> >>> >> >> The actual data shouldn’t matter since it’s not being
>>>>>ingested
>>>>> >>> >> >> (are you doing it in place, or what data transferer are you
>>>>> >>>using)?
>>>>> >>> >> >>
>>>>> >>> >> >> Cheers,
>>>>> >>> >> >> Chris
>>>>> >>> >> >>
>>>>> >>> >> >>
>>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> >>> >> >> Chris Mattmann, Ph.D.
>>>>> >>> >> >> Chief Architect
>>>>> >>> >> >> Instrument Software and Science Data Systems Section (398)
>>>>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> >>> >> >> Office: 168-519, Mailstop: 168-527
>>>>> >>> >> >> Email: chris.a.mattm...@nasa.gov
>>>>> >>> >> >> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> >>> >> >>
>>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> >>> >> >> Adjunct Associate Professor, Computer Science Department
>>>>> >>> >> >> University of Southern California, Los Angeles, CA 90089
>>>>>USA
>>>>> >>> >> >>
>>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> >>> >> >>
>>>>> >>> >> >>
>>>>> >>> >> >>
>>>>> >>> >> >>
>>>>> >>> >> >>
>>>>> >>> >> >> -----Original Message-----
>>>>> >>> >> >> From: Tom Barber <tom.bar...@meteorite.bi>
>>>>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM
>>>>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>>>> >>> >> >> Subject: Crawling / Archiving binary data with Solr backend
>>>>> >>> >> >>
>>>>> >>> >> >> >Hello,
>>>>> >>> >> >> >
>>>>> >>> >> >> >Looks like I've never tried it before with binary data.
>>>>>If I
>>>>> >>>swap
>>>>> >>> >>the
>>>>> >>> >> >> >filemgr defaults to use solr then try and crawl my staging
>>>>> >>> directory
>>>>> >>> >> >>using
>>>>> >>> >> >> >the Tika extractor I get a lot of
>>>>> >>> >> >> >
>>>>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
>>>>> >>> >> >>
>>>>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
>>>>> >>> >>Error
>>>>> >>> >> >> >ingesting product
>>>>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
>>>>> >>> >> >> :
>>>>> >>> >> >> >null
>>>>> >>> >> >> >at
>>>>> >>> >> >>
>>>>> >>> >>
>>>>> >>>
>>>>> >>>
>>>>>
>>>>> 
>>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeExceptio
>>>>>>>>>>>>>n(Xml
>>>>> >>>>>>>>Rpc
>>>>> >>> >>>>>Cl
>>>>> >>> >> >>>ie
>>>>> >>> >> >> >ntResponseProcessor.java:104)
>>>>> >>> >> >> >at
>>>>> >>> >> >>
>>>>> >>> >>
>>>>> >>>
>>>>> >>>
>>>>>
>>>>> 
>>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse
>>>>>>>>>>>>>(XmlR
>>>>> >>>>>>>>pcC
>>>>> >>> >>>>>li
>>>>> >>> >> >>>en
>>>>> >>> >> >> >tResponseProcessor.java:71)
>>>>> >>> >> >> >at
>>>>> >>> >> >>
>>>>> >>> >>
>>>>> >>>
>>>>> >>>
>>>>>
>>>>> 
>>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorke
>>>>>>>>>>>>>r.jav
>>>>> >>>>>>>>a:7
>>>>> >>> >>>>>3)
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> >Type things.
>>>>> >>> >> >> >
>>>>> >>> >> >> >Any ideas?
>>>>> >>> >> >> >
>>>>> >>> >> >> >Tom
>>>>> >>> >> >>
>>>>> >>> >> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Reply via email to