Gotcha. File type/protocol analysis is always tricky. While they *should* all work the same, you will probably have better results with text-based formats to start, so your approach makes a lot of sense. I look forward to seeing these features soon. Reach out if you have any more questions.
Andy LoPresto [email protected] [email protected] PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > On Sep 28, 2016, at 10:53 AM, Russell Bateman > <[email protected]> wrote: > > Thanks, Andy. > > I'm delighted to know I wasn't the only one. > > As it turns out, I've largely finished my generic Tika processor for NiFi. I > found that Mark was right on about the session.read()behavior. The tests I > had driving this work were for plain text, PDF and /.png/. It doesn't yet do > the latter, the one that made me think there might be a problem reading > twice. I'll be adding XML and HTML before concentrating on images again which > just plain don't work, so I'll have to learn something. > > Russ > > > On 09/28/2016 11:29 AM, Andy LoPresto wrote: >> Russell, >> >> While I believe Mark is right on using session.read() multiple times, I >> encountered this a while ago and I didn’t know about the session.read() >> behavior, so I used .mark() and .reset(). You can see this in CipherUtility >> [1] when I was reading salts and IVs from cipher text input streams. >> >> I’m not sure if this will work in your scenario where they are two different >> InputStreamCallbacks, because as Mark said, each call to session.read() >> should result in a new InputStream. >> >> [1] >> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/crypto/CipherUtility.java#L264-L284 >> >> Andy LoPresto >> [email protected] <mailto:[email protected]> >> /[email protected] <mailto:[email protected]>/ >> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 >> >>> On Sep 28, 2016, at 9:47 AM, Russell Bateman >>> <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Mark, >>> >>> Thanks for this clarification, getting a new stream the second time is very >>> convenient. >>> >>> Yes, this answers my question except that this isn't what I observe, but >>> I'm probably mistaken in my observation. I'm actually giving the stream to >>> Tika for MIME-type identification if the type isn't already known, then >>> asking Tika to parse it (based on the known type). >>> >>> I'll stop back by if I need to, but your confirmation tells me what I'm >>> seeing is likely some other effect. >>> >>> Thanks. >>> >>> >>> On 09/28/2016 10:42 AM, Mark Payne wrote: >>>> Russ, >>>> >>>> Each time that you call session.read(), you're going to get a new >>>> InputStream that starts at the beginning >>>> of the FlowFile. So you can just call session.read() twice. For example: >>>> >>>> final AtomicBoolean processContents = new AtomicBoolean(false); >>>> session.read(flowFile, new InputStreamCallback() { >>>> public void process(InputStream in) { >>>> // read contents >>>> readContents.set( someValue ); >>>> } >>>> }); >>>> >>>> if (processContents.get()) { >>>> session.read(flowFile, new InputStreamCallback() { >>>> public void process(InputStream in) { >>>> // we now have a new InputStream that starts at the beginning of the >>>> FlowFile. >>>> } >>>> }); >>>> } >>>> >>>> >>>> Does this answer your question sufficiently? >>>> >>>> Thanks >>>> -Mark >>>> >>>> >>>>> On Sep 28, 2016, at 12:30 PM, Russell Bateman >>>>> <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> This is more a Java question, I'm guessing. I have experimented >>>>> unconvincingly using Apache Commons I/O TeeInputStream, but let me back >>>>> up... >>>>> >>>>> I just need to, in some cases, consume the input stream: >>>>> >>>>> // under some condition, look into the flowfile contents to see if >>>>> something's there... >>>>> session.read( flowfile, new InputStreamCallback() >>>>> { >>>>> @Override >>>>> public void process( InputStream( in ) throws IOException >>>>> { >>>>> // read from in.. >>>>> } >>>>> } ); >>>>> >>>>> >>>>> then, later (and always) consume it (so, sometimes a second time): >>>>> >>>>> session.read( flowfile, new InputStreamCallback() >>>>> { >>>>> @Override >>>>> public void process( InputStream( in ) throws IOException >>>>> { >>>>> // read from in.. >>>>> } >>>>> } ); >>>>> >>>>> >>>>> Obviously, the content's gone at that point if I've already consumed it. >>>>> >>>>> What should I do here instead? I don't have control over the close(), do >>>>> I? >>>>> >>>>> Thanks for any comment, >>>>> >>>>> Russ >>> >> >
signature.asc
Description: Message signed with OpenPGP using GPGMail
