Thanks, Andy.
I'm delighted to know I wasn't the only one.
As it turns out, I've largely finished my generic Tika processor for
NiFi. I found that Mark was right on about the session.read()behavior.
The tests I had driving this work were for plain text, PDF and /.png/.
It doesn't yet do the latter, the one that made me think there might be
a problem reading twice. I'll be adding XML and HTML before
concentrating on images again which just plain don't work, so I'll have
to learn something.
Russ
On 09/28/2016 11:29 AM, Andy LoPresto wrote:
Russell,
While I believe Mark is right on using session.read() multiple times,
I encountered this a while ago and I didn’t know about the
session.read() behavior, so I used .mark() and .reset(). You can see
this in CipherUtility [1] when I was reading salts and IVs from cipher
text input streams.
I’m not sure if this will work in your scenario where they are two
different InputStreamCallbacks, because as Mark said, each call to
session.read() should result in a new InputStream.
[1]
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/crypto/CipherUtility.java#L264-L284
Andy LoPresto
[email protected] <mailto:[email protected]>
/[email protected] <mailto:[email protected]>/
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
On Sep 28, 2016, at 9:47 AM, Russell Bateman
<[email protected]
<mailto:[email protected]>> wrote:
Mark,
Thanks for this clarification, getting a new stream the second time
is very convenient.
Yes, this answers my question except that this isn't what I observe,
but I'm probably mistaken in my observation. I'm actually giving the
stream to Tika for MIME-type identification if the type isn't already
known, then asking Tika to parse it (based on the known type).
I'll stop back by if I need to, but your confirmation tells me what
I'm seeing is likely some other effect.
Thanks.
On 09/28/2016 10:42 AM, Mark Payne wrote:
Russ,
Each time that you call session.read(), you're going to get a new
InputStream that starts at the beginning
of the FlowFile. So you can just call session.read() twice. For example:
final AtomicBoolean processContents = new AtomicBoolean(false);
session.read(flowFile, new InputStreamCallback() {
public void process(InputStream in) {
// read contents
readContents.set( someValue );
}
});
if (processContents.get()) {
session.read(flowFile, new InputStreamCallback() {
public void process(InputStream in) {
// we now have a new InputStream that starts at the beginning of the
FlowFile.
}
});
}
Does this answer your question sufficiently?
Thanks
-Mark
On Sep 28, 2016, at 12:30 PM, Russell Bateman
<[email protected]
<mailto:[email protected]>> wrote:
This is more a Java question, I'm guessing. I have experimented
unconvincingly using Apache Commons I/O TeeInputStream, but let me
back up...
I just need to, in some cases, consume the input stream:
// under some condition, look into the flowfile contents to see if
something's there...
session.read( flowfile, new InputStreamCallback()
{
@Override
public void process( InputStream( in ) throws IOException
{
// read from in..
}
} );
then, later (and always) consume it (so, sometimes a second time):
session.read( flowfile, new InputStreamCallback()
{
@Override
public void process( InputStream( in ) throws IOException
{
// read from in..
}
} );
Obviously, the content's gone at that point if I've already
consumed it.
What should I do here instead? I don't have control over the
close(), do I?
Thanks for any comment,
Russ