[jira] [Updated] (NIFI-1680) Improve StreamScanner performance

Oleg Zhurakousky (JIRA) Sat, 09 Apr 2016 09:41:18 -0700

     [ 
https://issues.apache.org/jira/browse/NIFI-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Oleg Zhurakousky updated NIFI-1680:
-----------------------------------
    Description: 
We have several use cases where a content of a single FlowFile is split within 
a single cycle (i.e., _onTrigger()_). An example is PutKafka.
Such splitting involves parsing of a content InputStream into chunks 
represented as byte[]. Currently we are using custom buffer 
(_org.apache.nifi.stream.io.util.StreamScanner_) to build byte[]. 
There are several potential areas of improvement here:
1. Perform internal buffering instead of using ByteArrayInputStream and copy 
bytes into byte array(the bytes that are already in the buffer of 
ByteArrayInputStream)
2. The buffer itself is allocated on the heap. We can consider using 
DirectBuffer (as optional flag)
3. Consider buffer pool where new instance of StreamScanner can pool work 
buffer from such pol instead of allocating new one. 

The #1 is by far the most important as it shows (in my test environment) 6 
times performance improvement over the current implementation of StreamScanner 
and 1.5 times performance improvement over _java.io.BufferedReader_ which only 
supports implicit new line delimiter and was used only for comparison. 


  was:
We have several use cases where a content of a single FlowFile is split within 
a single cycle (i.e., _onTrigger()_). An example is PutKafka.
Such splitting involves parsing of a content InputStream into chunks 
represented as byte[]. Currently we are using custom buffer 
(_org.apache.nifi.stream.io.ByteArrayOutputStream_) to build byte[]. 
There are several potential areas of improvement here:
1. For every cycle we are creating a new instance of _ByteArrayOutputStream_ 
which essentially reallocates the buffer used to build byte[].
2. The buffer itself is allocated on the heap
3. Similar to _ByteArrayOutputStream_, the _BufferedInputStream_ is also 
created for each cycle where with some customization it could be reused.



> Improve StreamScanner performance
> ---------------------------------
>
>                 Key: NIFI-1680
>                 URL: https://issues.apache.org/jira/browse/NIFI-1680
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Oleg Zhurakousky
>            Assignee: Oleg Zhurakousky
>             Fix For: 0.7.0
>
>
> We have several use cases where a content of a single FlowFile is split 
> within a single cycle (i.e., _onTrigger()_). An example is PutKafka.
> Such splitting involves parsing of a content InputStream into chunks 
> represented as byte[]. Currently we are using custom buffer 
> (_org.apache.nifi.stream.io.util.StreamScanner_) to build byte[]. 
> There are several potential areas of improvement here:
> 1. Perform internal buffering instead of using ByteArrayInputStream and copy 
> bytes into byte array(the bytes that are already in the buffer of 
> ByteArrayInputStream)
> 2. The buffer itself is allocated on the heap. We can consider using 
> DirectBuffer (as optional flag)
> 3. Consider buffer pool where new instance of StreamScanner can pool work 
> buffer from such pol instead of allocating new one. 
> The #1 is by far the most important as it shows (in my test environment) 6 
> times performance improvement over the current implementation of 
> StreamScanner and 1.5 times performance improvement over 
> _java.io.BufferedReader_ which only supports implicit new line delimiter and 
> was used only for comparison. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NIFI-1680) Improve StreamScanner performance

Reply via email to