yeah check the metadata. Any weird UTF-8 encoding? (aka run tika on the file outside of OODT what do you see?)
— Chris Mattmann chris.mattm...@gmail.com -----Original Message----- From: Tom Barber <tom.bar...@meteorite.bi> Reply-To: <dev@oodt.apache.org> Date: Monday, November 23, 2015 at 7:23 AM To: "dev@oodt.apache.org" <dev@oodt.apache.org> Subject: Re: Crawling / Archiving binary data with Solr backend >./crawler/bin/crawler_launcher --filemgrUrl http://localhost:9000 >--operation --launchMetCrawler --clientTransferer >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory >--productPath $OODT_HOME/data/staging --metExtractor >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor >--metExtractorConfig /home/bugg/Projects/surrey100/oodt/data/met/tika.conf > >I'm running that. Which runs fine with the default lucene stuff, also runs >fine with a txt file, but doesn't run fine over a random picture I took or >over an mp3 I tested it on. > > >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.nasa.gov> wrote: > >> Encoding issues with the extracted metadata? What are you getting >> just running Tika on the files? >> >> The actual data shouldn’t matter since it’s not being ingested >> (are you doing it in place, or what data transferer are you using)? >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Tom Barber <tom.bar...@meteorite.bi> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org> >> Date: Monday, November 23, 2015 at 6:36 AM >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >> Subject: Crawling / Archiving binary data with Solr backend >> >> >Hello, >> > >> >Looks like I've never tried it before with binary data. If I swap the >> >filemgr defaults to use solr then try and crawl my staging directory >>using >> >the Tika extractor I get a lot of >> > >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception: >> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error >> >ingesting product >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476] >> : >> >null >> >at >> >>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(XmlRpcCl >>>ie >> >ntResponseProcessor.java:104) >> >at >> >>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlRpcCli >>>en >> >tResponseProcessor.java:71) >> >at >> >>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:73) >> > >> > >> >Type things. >> > >> >Any ideas? >> > >> >Tom >> >>