yep, agreed. — Chris Mattmann chris.mattm...@gmail.com
-----Original Message----- From: Tom Barber <tom.bar...@meteorite.bi> Reply-To: <dev@oodt.apache.org> Date: Monday, November 23, 2015 at 9:06 AM To: "dev@oodt.apache.org" <dev@oodt.apache.org> Subject: Re: Crawling / Archiving binary data with Solr backend >Dumping a .met file and calling the filemgr client ingest routine works >fine, so its something either broken or i'm doing wrong in the crawler it >appears. > >Tom > >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <tom.bar...@meteorite.bi> >wrote: > >> I'll give it a go. Thanks. >> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann >><chris.mattm...@gmail.com> >> wrote: >> >>> Doesn’t look weird. Hmm. Can you generate a metadata file >>> using TikaCmdLine extractor and then use that metadata file >>> to ingest into File Manager by hand? Does that work? >>> >>> — >>> Chris Mattmann >>> chris.mattm...@gmail.com >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Tom Barber <tom.bar...@meteorite.bi> >>> Reply-To: <dev@oodt.apache.org> >>> Date: Monday, November 23, 2015 at 7:43 AM >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> Subject: Re: Crawling / Archiving binary data with Solr backend >>> >>> >Author: Alun Davis - Loudmouth >>> >Content-Length: 3273160 >>> >Content-Type: audio/mpeg >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74 >>> >X-TIKA:digest:SHA256: >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0 >>> >channels: 2 >>> >creator: Alun Davis - Loudmouth >>> >dc:creator: Alun Davis - Loudmouth >>> >dc:title: Teenage Baghead >>> >meta:author: Alun Davis - Loudmouth >>> >resourceName: Teenage Baghead.mp3 >>> >samplerate: 44100 >>> >title: Teenage Baghead >>> >version: MPEG 3 Layer III Version 1 >>> >xmpDM:album: >>> >xmpDM:artist: Alun Davis - Loudmouth >>> >xmpDM:audioChannelType: Stereo >>> >xmpDM:audioCompressor: MP3 >>> >xmpDM:audioSampleRate: 44100 >>> >xmpDM:duration: 204577.046875 >>> >xmpDM:genre: Pop >>> >xmpDM:logComment: www.maimthattune.com for more! >>> >xmpDM:releaseDate: 2001 >>> > >>> > >>> >Nothing that should scare a parser in the mp3 at least. >>> > >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann < >>> chris.mattm...@gmail.com> >>> >wrote: >>> > >>> >> yeah check the metadata. Any weird UTF-8 encoding? >>> >> >>> >> (aka run tika on the file outside of OODT what do you see?) >>> >> >>> >> — >>> >> Chris Mattmann >>> >> chris.mattm...@gmail.com >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> -----Original Message----- >>> >> From: Tom Barber <tom.bar...@meteorite.bi> >>> >> Reply-To: <dev@oodt.apache.org> >>> >> Date: Monday, November 23, 2015 at 7:23 AM >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> >> Subject: Re: Crawling / Archiving binary data with Solr backend >>> >> >>> >> >./crawler/bin/crawler_launcher --filemgrUrl >>>http://localhost:9000 >>> >> >--operation --launchMetCrawler --clientTransferer >>> >> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory >>> >> >--productPath $OODT_HOME/data/staging --metExtractor >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor >>> >> >--metExtractorConfig >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf >>> >> > >>> >> >I'm running that. Which runs fine with the default lucene stuff, >>>also >>> >>runs >>> >> >fine with a txt file, but doesn't run fine over a random picture I >>> >>took or >>> >> >over an mp3 I tested it on. >>> >> > >>> >> > >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) < >>> >> >chris.a.mattm...@jpl.nasa.gov> wrote: >>> >> > >>> >> >> Encoding issues with the extracted metadata? What are you getting >>> >> >> just running Tika on the files? >>> >> >> >>> >> >> The actual data shouldn’t matter since it’s not being ingested >>> >> >> (are you doing it in place, or what data transferer are you >>>using)? >>> >> >> >>> >> >> Cheers, >>> >> >> Chris >>> >> >> >>> >> >> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> Chris Mattmann, Ph.D. >>> >> >> Chief Architect >>> >> >> Instrument Software and Science Data Systems Section (398) >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> >> >> Office: 168-519, Mailstop: 168-527 >>> >> >> Email: chris.a.mattm...@nasa.gov >>> >> >> WWW: http://sunset.usc.edu/~mattmann/ >>> >> >> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> Adjunct Associate Professor, Computer Science Department >>> >> >> University of Southern California, Los Angeles, CA 90089 USA >>> >> >> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> -----Original Message----- >>> >> >> From: Tom Barber <tom.bar...@meteorite.bi> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org> >>> >> >> Subject: Crawling / Archiving binary data with Solr backend >>> >> >> >>> >> >> >Hello, >>> >> >> > >>> >> >> >Looks like I've never tried it before with binary data. If I >>>swap >>> >>the >>> >> >> >filemgr defaults to use solr then try and crawl my staging >>> directory >>> >> >>using >>> >> >> >the Tika extractor I get a lot of >>> >> >> > >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception: >>> >> >> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: >>> >>Error >>> >> >> >ingesting product >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476] >>> >> >> : >>> >> >> >null >>> >> >> >at >>> >> >> >>> >> >>> >>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml >>>>>>>>Rpc >>> >>>>>Cl >>> >> >>>ie >>> >> >> >ntResponseProcessor.java:104) >>> >> >> >at >>> >> >> >>> >> >>> >>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR >>>>>>>>pcC >>> >>>>>li >>> >> >>>en >>> >> >> >tResponseProcessor.java:71) >>> >> >> >at >>> >> >> >>> >> >>> >>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav >>>>>>>>a:7 >>> >>>>>3) >>> >> >> > >>> >> >> > >>> >> >> >Type things. >>> >> >> > >>> >> >> >Any ideas? >>> >> >> > >>> >> >> >Tom >>> >> >> >>> >> >> >>> >> >>> >> >>> >> >>> >>> >>> >>