Thanks, Andy.

I'm delighted to know I wasn't the only one.

As it turns out, I've largely finished my generic Tika processor for NiFi. I found that Mark was right on about the session.read()behavior. The tests I had driving this work were for plain text, PDF and /.png/. It doesn't yet do the latter, the one that made me think there might be a problem reading twice. I'll be adding XML and HTML before concentrating on images again which just plain don't work, so I'll have to learn something.

Russ


On 09/28/2016 11:29 AM, Andy LoPresto wrote:
Russell,

While I believe Mark is right on using session.read() multiple times, I encountered this a while ago and I didn’t know about the session.read() behavior, so I used .mark() and .reset(). You can see this in CipherUtility [1] when I was reading salts and IVs from cipher text input streams.

I’m not sure if this will work in your scenario where they are two different InputStreamCallbacks, because as Mark said, each call to session.read() should result in a new InputStream.

[1] https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/crypto/CipherUtility.java#L264-L284

Andy LoPresto
[email protected] <mailto:[email protected]>
/[email protected] <mailto:[email protected]>/
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Sep 28, 2016, at 9:47 AM, Russell Bateman <[email protected] <mailto:[email protected]>> wrote:

Mark,

Thanks for this clarification, getting a new stream the second time is very convenient.

Yes, this answers my question except that this isn't what I observe, but I'm probably mistaken in my observation. I'm actually giving the stream to Tika for MIME-type identification if the type isn't already known, then asking Tika to parse it (based on the known type).

I'll stop back by if I need to, but your confirmation tells me what I'm seeing is likely some other effect.

Thanks.


On 09/28/2016 10:42 AM, Mark Payne wrote:
Russ,

Each time that you call session.read(), you're going to get a new InputStream that starts at the beginning
of the FlowFile. So you can just call session.read() twice. For example:

final AtomicBoolean processContents = new AtomicBoolean(false);
session.read(flowFile, new InputStreamCallback() {
    public void process(InputStream in) {
       // read contents
      readContents.set( someValue );
    }
});

if (processContents.get()) {
    session.read(flowFile, new InputStreamCallback() {
        public void process(InputStream in) {
// we now have a new InputStream that starts at the beginning of the FlowFile.
        }
    });
}


Does this answer your question sufficiently?

Thanks
-Mark


On Sep 28, 2016, at 12:30 PM, Russell Bateman <[email protected] <mailto:[email protected]>> wrote:

This is more a Java question, I'm guessing. I have experimented unconvincingly using Apache Commons I/O TeeInputStream, but let me back up...

I just need to, in some cases, consume the input stream:

  // under some condition, look into the flowfile contents to see if
  something's there...
  session.read( flowfile, new InputStreamCallback()
  {
     @Override
  public void process( InputStream( in ) throws IOException
     {
       // read from in..
     }
  } );


then, later (and always) consume it (so, sometimes a second time):

  session.read( flowfile, new InputStreamCallback()
  {
     @Override
     public void process( InputStream( in ) throws IOException
     {
       // read from in..
     }
  } );


Obviously, the content's gone at that point if I've already consumed it.

What should I do here instead? I don't have control over the close(), do I?

Thanks for any comment,

Russ



Reply via email to