[jira] [Comment Edited] (NIFI-6496) Add compression support to record reader processor

Edward Armes (JIRA) Tue, 30 Jul 2019 09:21:11 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-6496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896255#comment-16896255
 ]


Edward Armes edited comment on NIFI-6496 at 7/30/19 4:20 PM:
-------------------------------------------------------------

[~otto]: The root of this is from user list mail which I've linked here: 
[http://mail-archives.apache.org/mod_mbox/nifi-users/201907.mbox/%3cCAAPh5FmUoKnadoq+8r2nb=16CjZ3wt=5kozcjukpx8woei2...@mail.gmail.com%3e]
 

Essentially the TL;DR of this was that due to (what I assume to be a) resource 
restriction, [~malthe] wanted to have the FlowFile content kept compressed as 
there content was easily compressible. Now the argument about the lack of 
resources I think is not relevant here. However I think what is proposed is 
fine for a custom processor, I don't think it is suitable for the standard 
processor library (for the reasons I outlined above).
----
Now as I understand it in general there are 2 types of processors in Nifi, ones 
that consume FlowFIles (Consumers) and ones that don't (Producers) . Now the 
internals of Nifi are easy to understand/explain in words, but hard to navigate 
in code. As I said FlowFile content is *never* kept in memory unless it is 
being used by a processor it is instead kept in the content repo (now that may 
be an in-memory repo). An actual FlowFIle (the object that is passed between 
processor to processor is kept FlowFile repo and inmemory until a time it is 
deemed to be inactive (for whatever reason) and then it is only in the FlowFile 
repo (again this can could be an in-memory only repo as well). Now, in 
correction to what I posted originally, the content of a FlowFIle is always 
exposed to a processor by an InputStream so it is always using non-blocking IO 
and thus the entire content for a FlowFile is not always in memory unless the 
processor author wishes it.

However what think the issue the original question to the mailing list is 
trying skate around is that the content and provenance repos  can grow quickly 
when the contents of a FlowFile is modified.

Now like [~malthe] has said one option would be to do some sort of shunt that 
would allow for de-compression on demand once the content has be run through 
the CompressContent processor quite how this would work I don't know, I suspect 
it would involve quite a bit of playing with the internals however. The other 
approach I could see is to modify either the default content repo 
implementation, the loading of content from the repo, or both. To enable 
compression and de-compresson of the content when it's loaded under certain 
circumstances. Now there would be of-course trade off's and there is also the 
"per-flow settings" question as well and therefore needs to be a discussion 
about this more in general. 


was (Author: bickerx2):
[~otto]: The root of this is from user list mail which I've linked here: 
[http://mail-archives.apache.org/mod_mbox/nifi-users/201907.mbox/%3cCAAPh5FmUoKnadoq+8r2nb=16CjZ3wt=5kozcjukpx8woei2...@mail.gmail.com%3e]
 

Essentially the TL;DR of this was that due to (what I assume to be a) resource 
restriction, the person wanted to have the FlowFile content kept compressed as 
there content was easily compressible. Now the argument about the lack of 
resources I think is not relevant here. However I think what is proposed is 
fine for a custom processor, I don't think it is suitable for the standard 
processor library (for the reasons I outlined above).

-----

Now as I understand it in general there are 2 types of processors in Nifi, ones 
that consume FlowFIles (Consumers) and ones that don't (Producers) . Now the 
internals of Nifi are easy to understand/explain in words, but hard to navigate 
in code. As I said FlowFile content is *never* kept in memory unless it is 
being used by a processor it is instead kept in the content repo (now that may 
be an in-memory repo). An actual FlowFIle (the object that is passed between 
processor to processor is kept FlowFile repo and inmemory until a time it is 
deemed to be inactive (for whatever reason) and then it is only in the FlowFile 
repo (again this can could be an in-memory only repo as well). Now, in 
correction to what I posted originally, the content of a FlowFIle is always 
exposed to a processor by an InputStream so it is always using non-blocking IO 
and thus the entire content for a FlowFile is not always in memory unless the 
processor author wishes it.

However what think the issue the original question to the mailing list is 
trying skate around is that the content and provenance repos  can grow quickly 
when the contents of a FlowFile is modified.

Now like [~malthe] has said one option would be to do some sort of shunt that 
would allow for de-compression on demand once the content has be run through 
the CompressContent processor quite how this would work I don't know, I suspect 
it would involve quite a bit of playing with the internals however. The other 
approach I could see is to modify either the default content repo 
implementation, the loading of content from the repo, or both. To enable 
compression and de-compresson of the content when it's loaded under certain 
circumstances. Now there would be of-course trade off's and there is also the 
"per-flow settings" question as well and therefore needs to be a discussion 
about this more in general. 

> Add compression support to record reader processor
> --------------------------------------------------
>
>                 Key: NIFI-6496
>                 URL: https://issues.apache.org/jira/browse/NIFI-6496
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Malthe Borch
>            Priority: Minor
>              Labels: easyfix, usability
>
> Text-based record formats such as CSV, JSON and XML compress well and will 
> often be transmitted in a compressed format. If compression support is added 
> to the relevant processors, users will not need to explicitly unpack files 
> before processing (which may not be feasible or practical due to space 
> requirements).
> There are at least two ways of implementing this, using either a generic 
> approach where a {{CompressedRecordReaderFactory}} is the basis for a new 
> controller service that wraps the underlying record reader controller service 
> (e.g. {{CSVReader}}); or adding the functionality at the relevant record 
> reader implementations.
> The latter option may provide a better UX because no additional 
> {{ControllerService}} has to be configured.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (NIFI-6496) Add compression support to record reader processor

Reply via email to