[
https://issues.apache.org/jira/browse/NIFI-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Oleg Zhurakousky updated NIFI-1680:
-----------------------------------
Description:
We have several use cases where a content of a single FlowFile is split within
a single cycle (i.e., _onTrigger()_). An example is PutKafka.
Such splitting involves parsing of a content InputStream into chunks
represented as byte[]. Currently we are using custom buffer
(_org.apache.nifi.stream.io.util.StreamScanner_) to build byte[].
There are several potential areas of improvement here:
1. Perform internal buffering instead of using ByteArrayInputStream and copy
bytes into byte array(the bytes that are already in the buffer of
ByteArrayInputStream)
2. The buffer itself is allocated on the heap. We can consider using
DirectBuffer (as optional flag)
3. Consider buffer pool where new instance of StreamScanner can pool work
buffer from such pol instead of allocating new one.
The #1 is by far the most important as it shows (in my test environment) 6
times performance improvement over the current implementation of StreamScanner
and 1.5 times performance improvement over _java.io.BufferedReader_ which only
supports implicit new line delimiter and was used only for comparison.
was:
We have several use cases where a content of a single FlowFile is split within
a single cycle (i.e., _onTrigger()_). An example is PutKafka.
Such splitting involves parsing of a content InputStream into chunks
represented as byte[]. Currently we are using custom buffer
(_org.apache.nifi.stream.io.ByteArrayOutputStream_) to build byte[].
There are several potential areas of improvement here:
1. For every cycle we are creating a new instance of _ByteArrayOutputStream_
which essentially reallocates the buffer used to build byte[].
2. The buffer itself is allocated on the heap
3. Similar to _ByteArrayOutputStream_, the _BufferedInputStream_ is also
created for each cycle where with some customization it could be reused.
> Improve StreamScanner performance
> ---------------------------------
>
> Key: NIFI-1680
> URL: https://issues.apache.org/jira/browse/NIFI-1680
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Oleg Zhurakousky
> Assignee: Oleg Zhurakousky
> Fix For: 0.7.0
>
>
> We have several use cases where a content of a single FlowFile is split
> within a single cycle (i.e., _onTrigger()_). An example is PutKafka.
> Such splitting involves parsing of a content InputStream into chunks
> represented as byte[]. Currently we are using custom buffer
> (_org.apache.nifi.stream.io.util.StreamScanner_) to build byte[].
> There are several potential areas of improvement here:
> 1. Perform internal buffering instead of using ByteArrayInputStream and copy
> bytes into byte array(the bytes that are already in the buffer of
> ByteArrayInputStream)
> 2. The buffer itself is allocated on the heap. We can consider using
> DirectBuffer (as optional flag)
> 3. Consider buffer pool where new instance of StreamScanner can pool work
> buffer from such pol instead of allocating new one.
> The #1 is by far the most important as it shows (in my test environment) 6
> times performance improvement over the current implementation of
> StreamScanner and 1.5 times performance improvement over
> _java.io.BufferedReader_ which only supports implicit new line delimiter and
> was used only for comparison.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)