good question, I think Rishi wrote that extractor, so you may want to ask him or just check the code. Would be a welcome improvement if it’s not there.
org.apache.oodt.cas.metadata.extractors.tika.fieldExcludeList -C ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Tom Barber <tom.bar...@meteorite.bi> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org> Date: Monday, November 23, 2015 at 4:08 PM To: "dev@oodt.apache.org" <dev@oodt.apache.org> Subject: Re: Crawling / Archiving binary data with Solr backend >okay then so it seems my phone writes some binary junk to a user comment >field. I don't really plan to use phone images, but what would be good >using the tika met extractor is to block certain fields in my tika.conf is >that possible? > >On Mon, Nov 23, 2015 at 7:29 PM, Tom Barber <tom.bar...@meteorite.bi> >wrote: > >> Ah ha. Think i've figured it out. The image has binary data in it, >>because >> that fails with the filemgr, so thats one failure. The mp3 failed >>because >> there was a space in the filename, but it appears the crawler can't cope >> with such trickery! >> >> On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.bar...@meteorite.bi> >> wrote: >> >>> filed jira, i'll finish my UI and workflow off for wednesday then >>>circle >>> back to it when I have 10 minutes to debug and see if its a quick >>> fix/config issue. Looks like its failing to decode binary data though >>>to me. >>> >>> Tom >>> >>> On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.bar...@meteorite.bi> >>> wrote: >>> >>>> Booooo >>>> >>>> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann < >>>> chris.mattm...@gmail.com> wrote: >>>> >>>>> yep, agreed. >>>>> >>>>> — >>>>> Chris Mattmann >>>>> chris.mattm...@gmail.com >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Tom Barber <tom.bar...@meteorite.bi> >>>>> Reply-To: <dev@oodt.apache.org> >>>>> Date: Monday, November 23, 2015 at 9:06 AM >>>>> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>>>> Subject: Re: Crawling / Archiving binary data with Solr backend >>>>> >>>>> >Dumping a .met file and calling the filemgr client ingest routine >>>>>works >>>>> >fine, so its something either broken or i'm doing wrong in the >>>>>crawler >>>>> it >>>>> >appears. >>>>> > >>>>> >Tom >>>>> > >>>>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber >>>>><tom.bar...@meteorite.bi> >>>>> >wrote: >>>>> > >>>>> >> I'll give it a go. Thanks. >>>>> >> >>>>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann >>>>> >><chris.mattm...@gmail.com> >>>>> >> wrote: >>>>> >> >>>>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file >>>>> >>> using TikaCmdLine extractor and then use that metadata file >>>>> >>> to ingest into File Manager by hand? Does that work? >>>>> >>> >>>>> >>> — >>>>> >>> Chris Mattmann >>>>> >>> chris.mattm...@gmail.com >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> -----Original Message----- >>>>> >>> From: Tom Barber <tom.bar...@meteorite.bi> >>>>> >>> Reply-To: <dev@oodt.apache.org> >>>>> >>> Date: Monday, November 23, 2015 at 7:43 AM >>>>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>>>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend >>>>> >>> >>>>> >>> >Author: Alun Davis - Loudmouth >>>>> >>> >Content-Length: 3273160 >>>>> >>> >Content-Type: audio/mpeg >>>>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser >>>>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74 >>>>> >>> >X-TIKA:digest:SHA256: >>>>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0 >>>>> >>> >channels: 2 >>>>> >>> >creator: Alun Davis - Loudmouth >>>>> >>> >dc:creator: Alun Davis - Loudmouth >>>>> >>> >dc:title: Teenage Baghead >>>>> >>> >meta:author: Alun Davis - Loudmouth >>>>> >>> >resourceName: Teenage Baghead.mp3 >>>>> >>> >samplerate: 44100 >>>>> >>> >title: Teenage Baghead >>>>> >>> >version: MPEG 3 Layer III Version 1 >>>>> >>> >xmpDM:album: >>>>> >>> >xmpDM:artist: Alun Davis - Loudmouth >>>>> >>> >xmpDM:audioChannelType: Stereo >>>>> >>> >xmpDM:audioCompressor: MP3 >>>>> >>> >xmpDM:audioSampleRate: 44100 >>>>> >>> >xmpDM:duration: 204577.046875 >>>>> >>> >xmpDM:genre: Pop >>>>> >>> >xmpDM:logComment: www.maimthattune.com for more! >>>>> >>> >xmpDM:releaseDate: 2001 >>>>> >>> > >>>>> >>> > >>>>> >>> >Nothing that should scare a parser in the mp3 at least. >>>>> >>> > >>>>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann < >>>>> >>> chris.mattm...@gmail.com> >>>>> >>> >wrote: >>>>> >>> > >>>>> >>> >> yeah check the metadata. Any weird UTF-8 encoding? >>>>> >>> >> >>>>> >>> >> (aka run tika on the file outside of OODT what do you see?) >>>>> >>> >> >>>>> >>> >> — >>>>> >>> >> Chris Mattmann >>>>> >>> >> chris.mattm...@gmail.com >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> -----Original Message----- >>>>> >>> >> From: Tom Barber <tom.bar...@meteorite.bi> >>>>> >>> >> Reply-To: <dev@oodt.apache.org> >>>>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM >>>>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>>>> >>> >> Subject: Re: Crawling / Archiving binary data with Solr >>>>>backend >>>>> >>> >> >>>>> >>> >> >./crawler/bin/crawler_launcher --filemgrUrl >>>>> >>>http://localhost:9000 >>>>> >>> >> >--operation --launchMetCrawler --clientTransferer >>>>> >>> >> >>>>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory >>>>> >>> >> >--productPath $OODT_HOME/data/staging --metExtractor >>>>> >>> >> >>>>>>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor >>>>> >>> >> >--metExtractorConfig >>>>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf >>>>> >>> >> > >>>>> >>> >> >I'm running that. Which runs fine with the default lucene >>>>>stuff, >>>>> >>>also >>>>> >>> >>runs >>>>> >>> >> >fine with a txt file, but doesn't run fine over a random >>>>> picture I >>>>> >>> >>took or >>>>> >>> >> >over an mp3 I tested it on. >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) < >>>>> >>> >> >chris.a.mattm...@jpl.nasa.gov> wrote: >>>>> >>> >> > >>>>> >>> >> >> Encoding issues with the extracted metadata? What are you >>>>> getting >>>>> >>> >> >> just running Tika on the files? >>>>> >>> >> >> >>>>> >>> >> >> The actual data shouldn’t matter since it’s not being >>>>>ingested >>>>> >>> >> >> (are you doing it in place, or what data transferer are you >>>>> >>>using)? >>>>> >>> >> >> >>>>> >>> >> >> Cheers, >>>>> >>> >> >> Chris >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>> >> >> Chris Mattmann, Ph.D. >>>>> >>> >> >> Chief Architect >>>>> >>> >> >> Instrument Software and Science Data Systems Section (398) >>>>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> >>> >> >> Office: 168-519, Mailstop: 168-527 >>>>> >>> >> >> Email: chris.a.mattm...@nasa.gov >>>>> >>> >> >> WWW: http://sunset.usc.edu/~mattmann/ >>>>> >>> >> >> >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>> >> >> Adjunct Associate Professor, Computer Science Department >>>>> >>> >> >> University of Southern California, Los Angeles, CA 90089 >>>>>USA >>>>> >>> >> >> >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >>> >> >> -----Original Message----- >>>>> >>> >> >> From: Tom Barber <tom.bar...@meteorite.bi> >>>>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>>>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM >>>>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>>>> >>> >> >> Subject: Crawling / Archiving binary data with Solr backend >>>>> >>> >> >> >>>>> >>> >> >> >Hello, >>>>> >>> >> >> > >>>>> >>> >> >> >Looks like I've never tried it before with binary data. >>>>>If I >>>>> >>>swap >>>>> >>> >>the >>>>> >>> >> >> >filemgr defaults to use solr then try and crawl my staging >>>>> >>> directory >>>>> >>> >> >>using >>>>> >>> >> >> >the Tika extractor I get a lot of >>>>> >>> >> >> > >>>>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception: >>>>> >>> >> >> >>>>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: >>>>> >>> >>Error >>>>> >>> >> >> >ingesting product >>>>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476] >>>>> >>> >> >> : >>>>> >>> >> >> >null >>>>> >>> >> >> >at >>>>> >>> >> >> >>>>> >>> >> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeExceptio >>>>>>>>>>>>>n(Xml >>>>> >>>>>>>>Rpc >>>>> >>> >>>>>Cl >>>>> >>> >> >>>ie >>>>> >>> >> >> >ntResponseProcessor.java:104) >>>>> >>> >> >> >at >>>>> >>> >> >> >>>>> >>> >> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse >>>>>>>>>>>>>(XmlR >>>>> >>>>>>>>pcC >>>>> >>> >>>>>li >>>>> >>> >> >>>en >>>>> >>> >> >> >tResponseProcessor.java:71) >>>>> >>> >> >> >at >>>>> >>> >> >> >>>>> >>> >> >>>>> >>> >>>>> >>> >>>>> >>>>> >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorke >>>>>>>>>>>>>r.jav >>>>> >>>>>>>>a:7 >>>>> >>> >>>>>3) >>>>> >>> >> >> > >>>>> >>> >> >> > >>>>> >>> >> >> >Type things. >>>>> >>> >> >> > >>>>> >>> >> >> >Any ideas? >>>>> >>> >> >> > >>>>> >>> >> >> >Tom >>>>> >>> >> >> >>>>> >>> >> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >> >>>>> >>>>> >>>>> >>>> >>> >>