Ah ha. Think i've figured it out. The image has binary data in it, because
that fails with the filemgr, so thats one failure. The mp3 failed because
there was a space in the filename, but it appears the crawler can't cope
with such trickery!

On Mon, Nov 23, 2015 at 7:24 PM, Tom Barber <tom.bar...@meteorite.bi> wrote:

> filed jira, i'll finish my UI and workflow off for wednesday then circle
> back to it when I have 10 minutes to debug and see if its a quick
> fix/config issue. Looks like its failing to decode binary data though to me.
>
> Tom
>
> On Mon, Nov 23, 2015 at 7:18 PM, Tom Barber <tom.bar...@meteorite.bi>
> wrote:
>
>>  Booooo
>>
>> On Mon, Nov 23, 2015 at 5:09 PM, Chris Mattmann <chris.mattm...@gmail.com
>> > wrote:
>>
>>> yep, agreed.
>>>
>>> —
>>> Chris Mattmann
>>> chris.mattm...@gmail.com
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Tom Barber <tom.bar...@meteorite.bi>
>>> Reply-To: <dev@oodt.apache.org>
>>> Date: Monday, November 23, 2015 at 9:06 AM
>>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>>
>>> >Dumping a .met file and calling the filemgr client ingest routine works
>>> >fine, so its something either broken or i'm doing wrong in the crawler
>>> it
>>> >appears.
>>> >
>>> >Tom
>>> >
>>> >On Mon, Nov 23, 2015 at 3:45 PM, Tom Barber <tom.bar...@meteorite.bi>
>>> >wrote:
>>> >
>>> >> I'll give it a go. Thanks.
>>> >>
>>> >> On Mon, Nov 23, 2015 at 3:44 PM, Chris Mattmann
>>> >><chris.mattm...@gmail.com>
>>> >> wrote:
>>> >>
>>> >>> Doesn’t look weird. Hmm. Can you generate a metadata file
>>> >>> using TikaCmdLine extractor and then use that metadata file
>>> >>> to ingest into File Manager by hand? Does that work?
>>> >>>
>>> >>> —
>>> >>> Chris Mattmann
>>> >>> chris.mattm...@gmail.com
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: Tom Barber <tom.bar...@meteorite.bi>
>>> >>> Reply-To: <dev@oodt.apache.org>
>>> >>> Date: Monday, November 23, 2015 at 7:43 AM
>>> >>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> Subject: Re: Crawling / Archiving binary data with Solr backend
>>> >>>
>>> >>> >Author: Alun Davis - Loudmouth
>>> >>> >Content-Length: 3273160
>>> >>> >Content-Type: audio/mpeg
>>> >>> >X-Parsed-By: org.apache.tika.parser.DefaultParser
>>> >>> >X-TIKA:digest:MD5: 5f374012180e94778346619515152f74
>>> >>> >X-TIKA:digest:SHA256:
>>> >>> >34d8bf9da8feb848922138eb7807c0d71ed92376422fb28c8cbbffe788574ab0
>>> >>> >channels: 2
>>> >>> >creator: Alun Davis - Loudmouth
>>> >>> >dc:creator: Alun Davis - Loudmouth
>>> >>> >dc:title: Teenage Baghead
>>> >>> >meta:author: Alun Davis - Loudmouth
>>> >>> >resourceName: Teenage Baghead.mp3
>>> >>> >samplerate: 44100
>>> >>> >title: Teenage Baghead
>>> >>> >version: MPEG 3 Layer III Version 1
>>> >>> >xmpDM:album:
>>> >>> >xmpDM:artist: Alun Davis - Loudmouth
>>> >>> >xmpDM:audioChannelType: Stereo
>>> >>> >xmpDM:audioCompressor: MP3
>>> >>> >xmpDM:audioSampleRate: 44100
>>> >>> >xmpDM:duration: 204577.046875
>>> >>> >xmpDM:genre: Pop
>>> >>> >xmpDM:logComment: www.maimthattune.com for more!
>>> >>> >xmpDM:releaseDate: 2001
>>> >>> >
>>> >>> >
>>> >>> >Nothing that should scare a parser in the mp3 at least.
>>> >>> >
>>> >>> >On Mon, Nov 23, 2015 at 3:33 PM, Chris Mattmann <
>>> >>> chris.mattm...@gmail.com>
>>> >>> >wrote:
>>> >>> >
>>> >>> >> yeah check the metadata. Any weird UTF-8 encoding?
>>> >>> >>
>>> >>> >> (aka run tika on the file outside of OODT what do you see?)
>>> >>> >>
>>> >>> >> —
>>> >>> >> Chris Mattmann
>>> >>> >> chris.mattm...@gmail.com
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> -----Original Message-----
>>> >>> >> From: Tom Barber <tom.bar...@meteorite.bi>
>>> >>> >> Reply-To: <dev@oodt.apache.org>
>>> >>> >> Date: Monday, November 23, 2015 at 7:23 AM
>>> >>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> >> Subject: Re: Crawling / Archiving binary data with Solr backend
>>> >>> >>
>>> >>> >> >./crawler/bin/crawler_launcher     --filemgrUrl
>>> >>>http://localhost:9000
>>> >>> >> >--operation --launchMetCrawler     --clientTransferer
>>> >>> >>
>>> >org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
>>> >>> >> >--productPath $OODT_HOME/data/staging     --metExtractor
>>> >>> >> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
>>> >>> >> >--metExtractorConfig
>>> >>> >>/home/bugg/Projects/surrey100/oodt/data/met/tika.conf
>>> >>> >> >
>>> >>> >> >I'm running that. Which runs fine with the default lucene stuff,
>>> >>>also
>>> >>> >>runs
>>> >>> >> >fine with a txt file, but doesn't run fine over a random picture
>>> I
>>> >>> >>took or
>>> >>> >> >over an mp3 I tested it on.
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >On Mon, Nov 23, 2015 at 3:12 PM, Mattmann, Chris A (3980) <
>>> >>> >> >chris.a.mattm...@jpl.nasa.gov> wrote:
>>> >>> >> >
>>> >>> >> >> Encoding issues with the extracted metadata? What are you
>>> getting
>>> >>> >> >> just running Tika on the files?
>>> >>> >> >>
>>> >>> >> >> The actual data shouldn’t matter since it’s not being ingested
>>> >>> >> >> (are you doing it in place, or what data transferer are you
>>> >>>using)?
>>> >>> >> >>
>>> >>> >> >> Cheers,
>>> >>> >> >> Chris
>>> >>> >> >>
>>> >>> >> >>
>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> >> >> Chris Mattmann, Ph.D.
>>> >>> >> >> Chief Architect
>>> >>> >> >> Instrument Software and Science Data Systems Section (398)
>>> >>> >> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> >>> >> >> Office: 168-519, Mailstop: 168-527
>>> >>> >> >> Email: chris.a.mattm...@nasa.gov
>>> >>> >> >> WWW:  http://sunset.usc.edu/~mattmann/
>>> >>> >> >>
>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> >> >> Adjunct Associate Professor, Computer Science Department
>>> >>> >> >> University of Southern California, Los Angeles, CA 90089 USA
>>> >>> >> >>
>>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> -----Original Message-----
>>> >>> >> >> From: Tom Barber <tom.bar...@meteorite.bi>
>>> >>> >> >> Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> >> >> Date: Monday, November 23, 2015 at 6:36 AM
>>> >>> >> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>>> >>> >> >> Subject: Crawling / Archiving binary data with Solr backend
>>> >>> >> >>
>>> >>> >> >> >Hello,
>>> >>> >> >> >
>>> >>> >> >> >Looks like I've never tried it before with binary data. If I
>>> >>>swap
>>> >>> >>the
>>> >>> >> >> >filemgr defaults to use solr then try and crawl my staging
>>> >>> directory
>>> >>> >> >>using
>>> >>> >> >> >the Tika extractor I get a lot of
>>> >>> >> >> >
>>> >>> >> >> >org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
>>> >>> >> >>
>>> >org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException:
>>> >>> >>Error
>>> >>> >> >> >ingesting product
>>> >>> >> >>[org.apache.oodt.cas.filemgr.structs.Product@62b19476]
>>> >>> >> >> :
>>> >>> >> >> >null
>>> >>> >> >> >at
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> >>>
>>>
>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(Xml
>>> >>>>>>>>Rpc
>>> >>> >>>>>Cl
>>> >>> >> >>>ie
>>> >>> >> >> >ntResponseProcessor.java:104)
>>> >>> >> >> >at
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> >>>
>>>
>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlR
>>> >>>>>>>>pcC
>>> >>> >>>>>li
>>> >>> >> >>>en
>>> >>> >> >> >tResponseProcessor.java:71)
>>> >>> >> >> >at
>>> >>> >> >>
>>> >>> >>
>>> >>>
>>> >>>
>>>
>>> >>>>>>>>org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.jav
>>> >>>>>>>>a:7
>>> >>> >>>>>3)
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >Type things.
>>> >>> >> >> >
>>> >>> >> >> >Any ideas?
>>> >>> >> >> >
>>> >>> >> >> >Tom
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>>>
>>
>

Reply via email to