Hi Karl, May be it is not best to send image streams as it will index the binary content. I can use multimedia extraction connectors before Tika connector in the connector pipeline. Only thing is that I will have to use tika internally to detect media types. No problem.
Thanks, Chalitha On Sat, Jul 18, 2015 at 12:54 PM, Karl Wright <[email protected]> wrote: > Mcf will transmit metadata for images but since there is no other content, > the main content stream will have zero length. This seems perfectly > correct to me; I cannot see that any changes are needed or even desirable > here. > > Thanks > Karl > > Sent from my Windows Phone > ------------------------------ > From: chalitha udara Perera > Sent: 7/18/2015 1:20 AM > > To: Karl Wright > Cc: [email protected] > Subject: Re: Repository document stream empty after Tika Transformation > > Hi Karl, > > I mainly work with images. Actually tika extracts exif metadata from > images. I have attached manifold log containing image metadata extracted > from tika. I like to use a separate connector after that to extract low > level features such as SIFT to provide image search. Currently cannot do > that because for images stream is zero. > > But I tried with some pdf documents and as you said I can see output > connection ingestion records with correct document sizes. > > Thank you, > Chalitha > > On Sat, Jul 18, 2015 at 2:47 AM, Karl Wright <[email protected]> wrote: > >> Hi Chalitha, >> >> The only documents I see here are documents that Tika cannot extract >> content from, namely JPG's etc. >> >> Karl >> >> >> On Fri, Jul 17, 2015 at 12:09 PM, chalitha udara Perera < >> [email protected]> wrote: >> >>> Hi Karl, >>> >>> Here I have attached the result from File System -> Tika Transform -> >>> Null Output. >>> Please find the attachment. >>> >>> Thank you, >>> Chalitha >>> >>> On Fri, Jul 17, 2015 at 6:41 PM, Karl Wright <[email protected]> wrote: >>> >>>> I don't see this here. >>>> >>>> I set up the following: >>>> - file system repository connection >>>> - null output connection >>>> - tika extractor >>>> - a job using all three >>>> >>>> Running the job and looking at the simple history, I see null output >>>> connection ingestion records that have proper document sizes. >>>> >>>> Can you repeat the same setup there, and tell me what you get? >>>> >>>> Thanks, >>>> Karl >>>> >>>> Sent from my Windows Phone >>>> ------------------------------ >>>> From: chalitha udara Perera >>>> Sent: 7/17/2015 8:46 AM >>>> To: Karl Wright >>>> Cc: [email protected] >>>> Subject: Re: Repository document stream empty after Tika Transformation >>>> >>>> Hi Karl, >>>> >>>> I'm using 2.1 release and I am using only the Solr output connector. >>>> If you look at the inputstream size ( >>>> document.getBinaryLength()) after tika connector it is zero. >>>> >>>> Thanks, >>>> Chalitha >>>> >>>> On Fri, Jul 17, 2015 at 6:08 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> The document stream contains what tika extracts. If it can't extract >>>>> anything then you will have an empty stream. >>>>> >>>>> It is also possible that if the stream is split, you are tripping over >>>>> a bug that was fixed some time ago. What mcf version is this, and do you >>>>> have more than one output? >>>>> >>>>> Karl >>>>> >>>>> Sent from my Windows Phone >>>>> ------------------------------ >>>>> From: chalitha udara Perera >>>>> Sent: 7/17/2015 7:25 AM >>>>> To: [email protected] >>>>> Subject: Repository document stream empty after Tika Transformation >>>>> >>>>> Hi All, >>>>> >>>>> I'm writing a transformation connector to extract low level features >>>>> from images. First I used that connector without tika extractor and I >>>>> worked fine. But when I used it with Tika connector (after tika) if fails >>>>> to extract features. After debugging I found out that the stream is empty >>>>> after tika transformation. >>>>> Actually inside tika connector, it creates a new in memory or file >>>>> stream output, but original input stream is never copied to it. Connector >>>>> should reset binary stream after utilizing the stream to get metadata so >>>>> the original inputstream is available from connector to connector. >>>>> >>>>> Here I have attached a simple solution of stream copy and reset that >>>>> worked for me. >>>>> >>>>> Thanks, >>>>> Chalitha >>>>> >>>>> -- >>>>> J.M Chalitha Udara Perera >>>>> >>>>> *Department of Computer Science and Engineering,* >>>>> *University of Moratuwa,* >>>>> *Sri Lanka* >>>>> >>>> >>>> >>>> >>>> -- >>>> J.M Chalitha Udara Perera >>>> >>>> *Department of Computer Science and Engineering,* >>>> *University of Moratuwa,* >>>> *Sri Lanka* >>>> >>> >>> >>> >>> -- >>> J.M Chalitha Udara Perera >>> >>> *Department of Computer Science and Engineering,* >>> *University of Moratuwa,* >>> *Sri Lanka* >>> >> >> > > > -- > J.M Chalitha Udara Perera > > *Department of Computer Science and Engineering,* > *University of Moratuwa,* > *Sri Lanka* > -- J.M Chalitha Udara Perera *Department of Computer Science and Engineering,* *University of Moratuwa,* *Sri Lanka*
