Ah ha. Think i've figured it out. The image has binary data in it, because that fails with the filemgr, so thats one failure. The mp3 failed because there was a space in the filename, but it appears the crawler can't cope with such trickery!
On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.bar...@meteorite.bi> wrote: > filed jira, i'll finish my UI and workflow off for wednesday then circle > back to it when I have 10 minutes to debug and see if its a quick > fix/config issue. Looks like its failing to decode binary data though to me. > > Tom > > On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.bar...@meteorite.bi> > wrote: > >> Booooo >> >> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <chris.mattm...@gmail.com >> > wrote: >> >>> yep, agreed. >>> >>> — >>> Chris Mattmann >>> chris.mattm...@gmail.com >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Tom Barber <tom.bar...@meteorite.bi> >>> Reply-To: <dev@oodt.apache.org> >>> Date: Monday, November 23, 2015 at 9:06 AM >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> Subject: Re: Crawling / Archiving binary data with Solr backend >>> >>> >Dumping a .met file and calling the filemgr client ingest routine works >>> >fine, so its something either broken or i'm doing wrong in the crawler >>> it >>> >appears. >>> > >>> >Tom >>> > >>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <tom.bar...@meteorite.bi> >>> >wrote: >>> > >>> >> I'll give it a go. Thanks. >>> >> >>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann >>> >><chris.mattm...@gmail.com> >>> >> wrote: >>> >> >>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file >>> >>> using TikaCmdLine extractor and then use that metadata file >>> >>> to ingest into File Manager by hand? Does that work? >>> >>> >>> >>> — >>> >>> Chris Mattmann >>> >>> chris.mattm...@gmail.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> >>> From: Tom Barber <tom.bar...@meteorite.bi> >>> >>> Reply-To: <dev@oodt.apache.org> >>> >>> Date: Monday, November 23, 2015 at 7:43 AM >>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend >>> >>> >>> >>> >Author: Alun Davis - Loudmouth >>> >>> >Content-Length: 3273160 >>> >>> >Content-Type: audio/mpeg >>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser >>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74 >>> >>> >X-TIKA:digest:SHA256: >>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0 >>> >>> >channels: 2 >>> >>> >creator: Alun Davis - Loudmouth >>> >>> >dc:creator: Alun Davis - Loudmouth >>> >>> >dc:title: Teenage Baghead >>> >>> >meta:author: Alun Davis - Loudmouth >>> >>> >resourceName: Teenage Baghead.mp3 >>> >>> >samplerate: 44100 >>> >>> >title: Teenage Baghead >>> >>> >version: MPEG 3 Layer III Version 1 >>> >>> >xmpDM:album: >>> >>> >xmpDM:artist: Alun Davis - Loudmouth >>> >>> >xmpDM:audioChannelType: Stereo >>> >>> >xmpDM:audioCompressor: MP3 >>> >>> >xmpDM:audioSampleRate: 44100 >>> >>> >xmpDM:duration: 204577.046875 >>> >>> >xmpDM:genre: Pop >>> >>> >xmpDM:logComment: www.maimthattune.com for more! >>> >>> >xmpDM:releaseDate: 2001 >>> >>> > >>> >>> > >>> >>> >Nothing that should scare a parser in the mp3 at least. >>> >>> > >>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann < >>> >>> chris.mattm...@gmail.com> >>> >>> >wrote: >>> >>> > >>> >>> >> yeah check the metadata. Any weird UTF-8 encoding? >>> >>> >> >>> >>> >> (aka run tika on the file outside of OODT what do you see?) >>> >>> >> >>> >>> >> — >>> >>> >> Chris Mattmann >>> >>> >> chris.mattm...@gmail.com >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> -----Original Message----- >>> >>> >> From: Tom Barber <tom.bar...@meteorite.bi> >>> >>> >> Reply-To: <dev@oodt.apache.org> >>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM >>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> >>> >> Subject: Re: Crawling / Archiving binary data with Solr backend >>> >>> >> >>> >>> >> >./crawler/bin/crawler_launcher --filemgrUrl >>> >>>http://localhost:9000 >>> >>> >> >--operation --launchMetCrawler --clientTransferer >>> >>> >> >>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory >>> >>> >> >--productPath $OODT_HOME/data/staging --metExtractor >>> >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor >>> >>> >> >--metExtractorConfig >>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf >>> >>> >> > >>> >>> >> >I'm running that. Which runs fine with the default lucene stuff, >>> >>>also >>> >>> >>runs >>> >>> >> >fine with a txt file, but doesn't run fine over a random picture >>> I >>> >>> >>took or >>> >>> >> >over an mp3 I tested it on. >>> >>> >> > >>> >>> >> > >>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) < >>> >>> >> >chris.a.mattm...@jpl.nasa.gov> wrote: >>> >>> >> > >>> >>> >> >> Encoding issues with the extracted metadata? What are you >>> getting >>> >>> >> >> just running Tika on the files? >>> >>> >> >> >>> >>> >> >> The actual data shouldn’t matter since it’s not being ingested >>> >>> >> >> (are you doing it in place, or what data transferer are you >>> >>>using)? >>> >>> >> >> >>> >>> >> >> Cheers, >>> >>> >> >> Chris >>> >>> >> >> >>> >>> >> >> >>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >> >> Chris Mattmann, Ph.D. >>> >>> >> >> Chief Architect >>> >>> >> >> Instrument Software and Science Data Systems Section (398) >>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> >>> >> >> Office: 168-519, Mailstop: 168-527 >>> >>> >> >> Email: chris.a.mattm...@nasa.gov >>> >>> >> >> WWW: http://sunset.usc.edu/~mattmann/ >>> >>> >> >> >>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >> >> Adjunct Associate Professor, Computer Science Department >>> >>> >> >> University of Southern California, Los Angeles, CA 90089 USA >>> >>> >> >> >>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> -----Original Message----- >>> >>> >> >> From: Tom Barber <tom.bar...@meteorite.bi> >>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM >>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> >>> >> >> Subject: Crawling / Archiving binary data with Solr backend >>> >>> >> >> >>> >>> >> >> >Hello, >>> >>> >> >> > >>> >>> >> >> >Looks like I've never tried it before with binary data. If I >>> >>>swap >>> >>> >>the >>> >>> >> >> >filemgr defaults to use solr then try and crawl my staging >>> >>> directory >>> >>> >> >>using >>> >>> >> >> >the Tika extractor I get a lot of >>> >>> >> >> > >>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception: >>> >>> >> >> >>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: >>> >>> >>Error >>> >>> >> >> >ingesting product >>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476] >>> >>> >> >> : >>> >>> >> >> >null >>> >>> >> >> >at >>> >>> >> >> >>> >>> >> >>> >>> >>> >>> >>> >>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml >>> >>>>>>>>Rpc >>> >>> >>>>>Cl >>> >>> >> >>>ie >>> >>> >> >> >ntResponseProcessor.java:104) >>> >>> >> >> >at >>> >>> >> >> >>> >>> >> >>> >>> >>> >>> >>> >>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR >>> >>>>>>>>pcC >>> >>> >>>>>li >>> >>> >> >>>en >>> >>> >> >> >tResponseProcessor.java:71) >>> >>> >> >> >at >>> >>> >> >> >>> >>> >> >>> >>> >>> >>> >>> >>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav >>> >>>>>>>>a:7 >>> >>> >>>>>3) >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> >> >Type things. >>> >>> >> >> > >>> >>> >> >> >Any ideas? >>> >>> >> >> > >>> >>> >> >> >Tom >>> >>> >> >> >>> >>> >> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >>> >>> >>> >>> >>> >> >>> >>> >>> >> >