Gotcha.

File type/protocol analysis is always tricky. While they *should* all work the 
same, you will probably have better results with text-based formats to start, 
so your approach makes a lot of sense. I look forward to seeing these features 
soon. Reach out if you have any more questions.


Andy LoPresto
[email protected]
[email protected]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Sep 28, 2016, at 10:53 AM, Russell Bateman 
> <[email protected]> wrote:
> 
> Thanks, Andy.
> 
> I'm delighted to know I wasn't the only one.
> 
> As it turns out, I've largely finished my generic Tika processor for NiFi. I 
> found that Mark was right on about the session.read()behavior. The tests I 
> had driving this work were for plain text, PDF and /.png/. It doesn't yet do 
> the latter, the one that made me think there might be a problem reading 
> twice. I'll be adding XML and HTML before concentrating on images again which 
> just plain don't work, so I'll have to learn something.
> 
> Russ
> 
> 
> On 09/28/2016 11:29 AM, Andy LoPresto wrote:
>> Russell,
>> 
>> While I believe Mark is right on using session.read() multiple times, I 
>> encountered this a while ago and I didn’t know about the session.read() 
>> behavior, so I used .mark() and .reset(). You can see this in CipherUtility 
>> [1] when I was reading salts and IVs from cipher text input streams.
>> 
>> I’m not sure if this will work in your scenario where they are two different 
>> InputStreamCallbacks, because as Mark said, each call to session.read() 
>> should result in a new InputStream.
>> 
>> [1] 
>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/crypto/CipherUtility.java#L264-L284
>> 
>> Andy LoPresto
>> [email protected] <mailto:[email protected]>
>> /[email protected] <mailto:[email protected]>/
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>>> On Sep 28, 2016, at 9:47 AM, Russell Bateman 
>>> <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Mark,
>>> 
>>> Thanks for this clarification, getting a new stream the second time is very 
>>> convenient.
>>> 
>>> Yes, this answers my question except that this isn't what I observe, but 
>>> I'm probably mistaken in my observation. I'm actually giving the stream to 
>>> Tika for MIME-type identification if the type isn't already known, then 
>>> asking Tika to parse it (based on the known type).
>>> 
>>> I'll stop back by if I need to, but your confirmation tells me what I'm 
>>> seeing is likely some other effect.
>>> 
>>> Thanks.
>>> 
>>> 
>>> On 09/28/2016 10:42 AM, Mark Payne wrote:
>>>> Russ,
>>>> 
>>>> Each time that you call session.read(), you're going to get a new 
>>>> InputStream that starts at the beginning
>>>> of the FlowFile. So you can just call session.read() twice. For example:
>>>> 
>>>> final AtomicBoolean processContents = new AtomicBoolean(false);
>>>> session.read(flowFile, new InputStreamCallback() {
>>>>    public void process(InputStream in) {
>>>>       // read contents
>>>>      readContents.set( someValue );
>>>>    }
>>>> });
>>>> 
>>>> if (processContents.get()) {
>>>>    session.read(flowFile, new InputStreamCallback() {
>>>>        public void process(InputStream in) {
>>>> // we now have a new InputStream that starts at the beginning of the 
>>>> FlowFile.
>>>>        }
>>>>    });
>>>> }
>>>> 
>>>> 
>>>> Does this answer your question sufficiently?
>>>> 
>>>> Thanks
>>>> -Mark
>>>> 
>>>> 
>>>>> On Sep 28, 2016, at 12:30 PM, Russell Bateman 
>>>>> <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> This is more a Java question, I'm guessing. I have experimented 
>>>>> unconvincingly using Apache Commons I/O TeeInputStream, but let me back 
>>>>> up...
>>>>> 
>>>>> I just need to, in some cases, consume the input stream:
>>>>> 
>>>>>  // under some condition, look into the flowfile contents to see if
>>>>>  something's there...
>>>>>  session.read( flowfile, new InputStreamCallback()
>>>>>  {
>>>>>     @Override
>>>>>  public void process( InputStream( in ) throws IOException
>>>>>     {
>>>>>       // read from in..
>>>>>     }
>>>>>  } );
>>>>> 
>>>>> 
>>>>> then, later (and always) consume it (so, sometimes a second time):
>>>>> 
>>>>>  session.read( flowfile, new InputStreamCallback()
>>>>>  {
>>>>>     @Override
>>>>>     public void process( InputStream( in ) throws IOException
>>>>>     {
>>>>>       // read from in..
>>>>>     }
>>>>>  } );
>>>>> 
>>>>> 
>>>>> Obviously, the content's gone at that point if I've already consumed it.
>>>>> 
>>>>> What should I do here instead? I don't have control over the close(), do 
>>>>> I?
>>>>> 
>>>>> Thanks for any comment,
>>>>> 
>>>>> Russ
>>> 
>> 
> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to