Hi Karl, I mainly work with images. Actually tika extracts exif metadata from images. I have attached manifold log containing image metadata extracted from tika. I like to use a separate connector after that to extract low level features such as SIFT to provide image search. Currently cannot do that because for images stream is zero.
But I tried with some pdf documents and as you said I can see output connection ingestion records with correct document sizes. Thank you, Chalitha On Sat, Jul 18, 2015 at 2:47 AM, Karl Wright <[email protected]> wrote: > Hi Chalitha, > > The only documents I see here are documents that Tika cannot extract > content from, namely JPG's etc. > > Karl > > > On Fri, Jul 17, 2015 at 12:09 PM, chalitha udara Perera < > [email protected]> wrote: > >> Hi Karl, >> >> Here I have attached the result from File System -> Tika Transform -> >> Null Output. >> Please find the attachment. >> >> Thank you, >> Chalitha >> >> On Fri, Jul 17, 2015 at 6:41 PM, Karl Wright <[email protected]> wrote: >> >>> I don't see this here. >>> >>> I set up the following: >>> - file system repository connection >>> - null output connection >>> - tika extractor >>> - a job using all three >>> >>> Running the job and looking at the simple history, I see null output >>> connection ingestion records that have proper document sizes. >>> >>> Can you repeat the same setup there, and tell me what you get? >>> >>> Thanks, >>> Karl >>> >>> Sent from my Windows Phone >>> ------------------------------ >>> From: chalitha udara Perera >>> Sent: 7/17/2015 8:46 AM >>> To: Karl Wright >>> Cc: [email protected] >>> Subject: Re: Repository document stream empty after Tika Transformation >>> >>> Hi Karl, >>> >>> I'm using 2.1 release and I am using only the Solr output connector. If >>> you look at the inputstream size ( >>> document.getBinaryLength()) after tika connector it is zero. >>> >>> Thanks, >>> Chalitha >>> >>> On Fri, Jul 17, 2015 at 6:08 PM, Karl Wright <[email protected]> wrote: >>> >>>> The document stream contains what tika extracts. If it can't extract >>>> anything then you will have an empty stream. >>>> >>>> It is also possible that if the stream is split, you are tripping over >>>> a bug that was fixed some time ago. What mcf version is this, and do you >>>> have more than one output? >>>> >>>> Karl >>>> >>>> Sent from my Windows Phone >>>> ------------------------------ >>>> From: chalitha udara Perera >>>> Sent: 7/17/2015 7:25 AM >>>> To: [email protected] >>>> Subject: Repository document stream empty after Tika Transformation >>>> >>>> Hi All, >>>> >>>> I'm writing a transformation connector to extract low level features >>>> from images. First I used that connector without tika extractor and I >>>> worked fine. But when I used it with Tika connector (after tika) if fails >>>> to extract features. After debugging I found out that the stream is empty >>>> after tika transformation. >>>> Actually inside tika connector, it creates a new in memory or file >>>> stream output, but original input stream is never copied to it. Connector >>>> should reset binary stream after utilizing the stream to get metadata so >>>> the original inputstream is available from connector to connector. >>>> >>>> Here I have attached a simple solution of stream copy and reset that >>>> worked for me. >>>> >>>> Thanks, >>>> Chalitha >>>> >>>> -- >>>> J.M Chalitha Udara Perera >>>> >>>> *Department of Computer Science and Engineering,* >>>> *University of Moratuwa,* >>>> *Sri Lanka* >>>> >>> >>> >>> >>> -- >>> J.M Chalitha Udara Perera >>> >>> *Department of Computer Science and Engineering,* >>> *University of Moratuwa,* >>> *Sri Lanka* >>> >> >> >> >> -- >> J.M Chalitha Udara Perera >> >> *Department of Computer Science and Engineering,* >> *University of Moratuwa,* >> *Sri Lanka* >> > > -- J.M Chalitha Udara Perera *Department of Computer Science and Engineering,* *University of Moratuwa,* *Sri Lanka*
