Yeah i did have a look but didn't see anything, I was just checking there wasn't any crawler-wide setting i was missing. I'll file it and do it later it would be beneficial.
Tom On Tue, Nov 24, 2015 at 3:48 AM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > good question, I think Rishi wrote that extractor, so you may > want to ask him or just check the code. Would be a welcome improvement > if it’s not there. > > org.apache.oodt.cas.metadata.extractors.tika.fieldExcludeList > > -C > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: Tom Barber <tom.bar...@meteorite.bi> > Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org> > Date: Monday, November 23, 2015 at 4:08 PM > To: "dev@oodt.apache.org" <dev@oodt.apache.org> > Subject: Re: Crawling / Archiving binary data with Solr backend > > >okay then so it seems my phone writes some binary junk to a user comment > >field. I don't really plan to use phone images, but what would be good > >using the tika met extractor is to block certain fields in my tika.conf is > >that possible? > > > >On Mon, Nov 23, 2015 at 7:29 PM, Tom Barber <tom.bar...@meteorite.bi> > >wrote: > > > >> Ah ha. Think i've figured it out. The image has binary data in it, > >>because > >> that fails with the filemgr, so thats one failure. The mp3 failed > >>because > >> there was a space in the filename, but it appears the crawler can't cope > >> with such trickery! > >> > >> On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.bar...@meteorite.bi> > >> wrote: > >> > >>> filed jira, i'll finish my UI and workflow off for wednesday then > >>>circle > >>> back to it when I have 10 minutes to debug and see if its a quick > >>> fix/config issue. Looks like its failing to decode binary data though > >>>to me. > >>> > >>> Tom > >>> > >>> On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.bar...@meteorite.bi> > >>> wrote: > >>> > >>>> Booooo > >>>> > >>>> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann < > >>>> chris.mattm...@gmail.com> wrote: > >>>> > >>>>> yep, agreed. > >>>>> > >>>>> — > >>>>> Chris Mattmann > >>>>> chris.mattm...@gmail.com > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -----Original Message----- > >>>>> From: Tom Barber <tom.bar...@meteorite.bi> > >>>>> Reply-To: <dev@oodt.apache.org> > >>>>> Date: Monday, November 23, 2015 at 9:06 AM > >>>>> To: "dev@oodt.apache.org" <dev@oodt.apache.org> > >>>>> Subject: Re: Crawling / Archiving binary data with Solr backend > >>>>> > >>>>> >Dumping a .met file and calling the filemgr client ingest routine > >>>>>works > >>>>> >fine, so its something either broken or i'm doing wrong in the > >>>>>crawler > >>>>> it > >>>>> >appears. > >>>>> > > >>>>> >Tom > >>>>> > > >>>>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber > >>>>><tom.bar...@meteorite.bi> > >>>>> >wrote: > >>>>> > > >>>>> >> I'll give it a go. Thanks. > >>>>> >> > >>>>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann > >>>>> >><chris.mattm...@gmail.com> > >>>>> >> wrote: > >>>>> >> > >>>>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file > >>>>> >>> using TikaCmdLine extractor and then use that metadata file > >>>>> >>> to ingest into File Manager by hand? Does that work? > >>>>> >>> > >>>>> >>> — > >>>>> >>> Chris Mattmann > >>>>> >>> chris.mattm...@gmail.com > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> -----Original Message----- > >>>>> >>> From: Tom Barber <tom.bar...@meteorite.bi> > >>>>> >>> Reply-To: <dev@oodt.apache.org> > >>>>> >>> Date: Monday, November 23, 2015 at 7:43 AM > >>>>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org> > >>>>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend > >>>>> >>> > >>>>> >>> >Author: Alun Davis - Loudmouth > >>>>> >>> >Content-Length: 3273160 > >>>>> >>> >Content-Type: audio/mpeg > >>>>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser > >>>>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74 > >>>>> >>> >X-TIKA:digest:SHA256: > >>>>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0 > >>>>> >>> >channels: 2 > >>>>> >>> >creator: Alun Davis - Loudmouth > >>>>> >>> >dc:creator: Alun Davis - Loudmouth > >>>>> >>> >dc:title: Teenage Baghead > >>>>> >>> >meta:author: Alun Davis - Loudmouth > >>>>> >>> >resourceName: Teenage Baghead.mp3 > >>>>> >>> >samplerate: 44100 > >>>>> >>> >title: Teenage Baghead > >>>>> >>> >version: MPEG 3 Layer III Version 1 > >>>>> >>> >xmpDM:album: > >>>>> >>> >xmpDM:artist: Alun Davis - Loudmouth > >>>>> >>> >xmpDM:audioChannelType: Stereo > >>>>> >>> >xmpDM:audioCompressor: MP3 > >>>>> >>> >xmpDM:audioSampleRate: 44100 > >>>>> >>> >xmpDM:duration: 204577.046875 > >>>>> >>> >xmpDM:genre: Pop > >>>>> >>> >xmpDM:logComment: www.maimthattune.com for more! > >>>>> >>> >xmpDM:releaseDate: 2001 > >>>>> >>> > > >>>>> >>> > > >>>>> >>> >Nothing that should scare a parser in the mp3 at least. > >>>>> >>> > > >>>>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann < > >>>>> >>> chris.mattm...@gmail.com> > >>>>> >>> >wrote: > >>>>> >>> > > >>>>> >>> >> yeah check the metadata. Any weird UTF-8 encoding? > >>>>> >>> >> > >>>>> >>> >> (aka run tika on the file outside of OODT what do you see?) > >>>>> >>> >> > >>>>> >>> >> — > >>>>> >>> >> Chris Mattmann > >>>>> >>> >> chris.mattm...@gmail.com > >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> >> -----Original Message----- > >>>>> >>> >> From: Tom Barber <tom.bar...@meteorite.bi> > >>>>> >>> >> Reply-To: <dev@oodt.apache.org> > >>>>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM > >>>>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> > >>>>> >>> >> Subject: Re: Crawling / Archiving binary data with Solr > >>>>>backend > >>>>> >>> >> > >>>>> >>> >> >./crawler/bin/crawler_launcher --filemgrUrl > >>>>> >>>http://localhost:9000 > >>>>> >>> >> >--operation --launchMetCrawler --clientTransferer > >>>>> >>> >> > >>>>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory > >>>>> >>> >> >--productPath $OODT_HOME/data/staging --metExtractor > >>>>> >>> >> > >>>>>>org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor > >>>>> >>> >> >--metExtractorConfig > >>>>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf > >>>>> >>> >> > > >>>>> >>> >> >I'm running that. Which runs fine with the default lucene > >>>>>stuff, > >>>>> >>>also > >>>>> >>> >>runs > >>>>> >>> >> >fine with a txt file, but doesn't run fine over a random > >>>>> picture I > >>>>> >>> >>took or > >>>>> >>> >> >over an mp3 I tested it on. > >>>>> >>> >> > > >>>>> >>> >> > > >>>>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) < > >>>>> >>> >> >chris.a.mattm...@jpl.nasa.gov> wrote: > >>>>> >>> >> > > >>>>> >>> >> >> Encoding issues with the extracted metadata? What are you > >>>>> getting > >>>>> >>> >> >> just running Tika on the files? > >>>>> >>> >> >> > >>>>> >>> >> >> The actual data shouldn’t matter since it’s not being > >>>>>ingested > >>>>> >>> >> >> (are you doing it in place, or what data transferer are you > >>>>> >>>using)? > >>>>> >>> >> >> > >>>>> >>> >> >> Cheers, > >>>>> >>> >> >> Chris > >>>>> >>> >> >> > >>>>> >>> >> >> > >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>> >>> >> >> Chris Mattmann, Ph.D. > >>>>> >>> >> >> Chief Architect > >>>>> >>> >> >> Instrument Software and Science Data Systems Section (398) > >>>>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>>> >>> >> >> Office: 168-519, Mailstop: 168-527 > >>>>> >>> >> >> Email: chris.a.mattm...@nasa.gov > >>>>> >>> >> >> WWW: http://sunset.usc.edu/~mattmann/ > >>>>> >>> >> >> > >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>> >>> >> >> Adjunct Associate Professor, Computer Science Department > >>>>> >>> >> >> University of Southern California, Los Angeles, CA 90089 > >>>>>USA > >>>>> >>> >> >> > >>>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>> >>> >> >> > >>>>> >>> >> >> > >>>>> >>> >> >> > >>>>> >>> >> >> > >>>>> >>> >> >> > >>>>> >>> >> >> -----Original Message----- > >>>>> >>> >> >> From: Tom Barber <tom.bar...@meteorite.bi> > >>>>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org> > >>>>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM > >>>>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> > >>>>> >>> >> >> Subject: Crawling / Archiving binary data with Solr backend > >>>>> >>> >> >> > >>>>> >>> >> >> >Hello, > >>>>> >>> >> >> > > >>>>> >>> >> >> >Looks like I've never tried it before with binary data. > >>>>>If I > >>>>> >>>swap > >>>>> >>> >>the > >>>>> >>> >> >> >filemgr defaults to use solr then try and crawl my staging > >>>>> >>> directory > >>>>> >>> >> >>using > >>>>> >>> >> >> >the Tika extractor I get a lot of > >>>>> >>> >> >> > > >>>>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception: > >>>>> >>> >> >> > >>>>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: > >>>>> >>> >>Error > >>>>> >>> >> >> >ingesting product > >>>>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476] > >>>>> >>> >> >> : > >>>>> >>> >> >> >null > >>>>> >>> >> >> >at > >>>>> >>> >> >> > >>>>> >>> >> > >>>>> >>> > >>>>> >>> > >>>>> > >>>>> > >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeExceptio > >>>>>>>>>>>>>n(Xml > >>>>> >>>>>>>>Rpc > >>>>> >>> >>>>>Cl > >>>>> >>> >> >>>ie > >>>>> >>> >> >> >ntResponseProcessor.java:104) > >>>>> >>> >> >> >at > >>>>> >>> >> >> > >>>>> >>> >> > >>>>> >>> > >>>>> >>> > >>>>> > >>>>> > >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse > >>>>>>>>>>>>>(XmlR > >>>>> >>>>>>>>pcC > >>>>> >>> >>>>>li > >>>>> >>> >> >>>en > >>>>> >>> >> >> >tResponseProcessor.java:71) > >>>>> >>> >> >> >at > >>>>> >>> >> >> > >>>>> >>> >> > >>>>> >>> > >>>>> >>> > >>>>> > >>>>> > >>>>>>>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorke > >>>>>>>>>>>>>r.jav > >>>>> >>>>>>>>a:7 > >>>>> >>> >>>>>3) > >>>>> >>> >> >> > > >>>>> >>> >> >> > > >>>>> >>> >> >> >Type things. > >>>>> >>> >> >> > > >>>>> >>> >> >> >Any ideas? > >>>>> >>> >> >> > > >>>>> >>> >> >> >Tom > >>>>> >>> >> >> > >>>>> >>> >> >> > >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> >> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >> > >>>>> > >>>>> > >>>>> > >>>> > >>> > >> > >