[
https://issues.apache.org/jira/browse/NIFI-6496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896255#comment-16896255
]
Edward Armes edited comment on NIFI-6496 at 7/30/19 4:20 PM:
-------------------------------------------------------------
[~otto]: The root of this is from user list mail which I've linked here:
[http://mail-archives.apache.org/mod_mbox/nifi-users/201907.mbox/%3cCAAPh5FmUoKnadoq+8r2nb=16CjZ3wt=5kozcjukpx8woei2...@mail.gmail.com%3e]
Essentially the TL;DR of this was that due to (what I assume to be a) resource
restriction, [~malthe] wanted to have the FlowFile content kept compressed as
there content was easily compressible. Now the argument about the lack of
resources I think is not relevant here. However I think what is proposed is
fine for a custom processor, I don't think it is suitable for the standard
processor library (for the reasons I outlined above).
----
Now as I understand it in general there are 2 types of processors in Nifi, ones
that consume FlowFIles (Consumers) and ones that don't (Producers) . Now the
internals of Nifi are easy to understand/explain in words, but hard to navigate
in code. As I said FlowFile content is *never* kept in memory unless it is
being used by a processor it is instead kept in the content repo (now that may
be an in-memory repo). An actual FlowFIle (the object that is passed between
processor to processor is kept FlowFile repo and inmemory until a time it is
deemed to be inactive (for whatever reason) and then it is only in the FlowFile
repo (again this can could be an in-memory only repo as well). Now, in
correction to what I posted originally, the content of a FlowFIle is always
exposed to a processor by an InputStream so it is always using non-blocking IO
and thus the entire content for a FlowFile is not always in memory unless the
processor author wishes it.
However what think the issue the original question to the mailing list is
trying skate around is that the content and provenance repos can grow quickly
when the contents of a FlowFile is modified.
Now like [~malthe] has said one option would be to do some sort of shunt that
would allow for de-compression on demand once the content has be run through
the CompressContent processor quite how this would work I don't know, I suspect
it would involve quite a bit of playing with the internals however. The other
approach I could see is to modify either the default content repo
implementation, the loading of content from the repo, or both. To enable
compression and de-compresson of the content when it's loaded under certain
circumstances. Now there would be of-course trade off's and there is also the
"per-flow settings" question as well and therefore needs to be a discussion
about this more in general.
was (Author: bickerx2):
[~otto]: The root of this is from user list mail which I've linked here:
[http://mail-archives.apache.org/mod_mbox/nifi-users/201907.mbox/%3cCAAPh5FmUoKnadoq+8r2nb=16CjZ3wt=5kozcjukpx8woei2...@mail.gmail.com%3e]
Essentially the TL;DR of this was that due to (what I assume to be a) resource
restriction, the person wanted to have the FlowFile content kept compressed as
there content was easily compressible. Now the argument about the lack of
resources I think is not relevant here. However I think what is proposed is
fine for a custom processor, I don't think it is suitable for the standard
processor library (for the reasons I outlined above).
-----
Now as I understand it in general there are 2 types of processors in Nifi, ones
that consume FlowFIles (Consumers) and ones that don't (Producers) . Now the
internals of Nifi are easy to understand/explain in words, but hard to navigate
in code. As I said FlowFile content is *never* kept in memory unless it is
being used by a processor it is instead kept in the content repo (now that may
be an in-memory repo). An actual FlowFIle (the object that is passed between
processor to processor is kept FlowFile repo and inmemory until a time it is
deemed to be inactive (for whatever reason) and then it is only in the FlowFile
repo (again this can could be an in-memory only repo as well). Now, in
correction to what I posted originally, the content of a FlowFIle is always
exposed to a processor by an InputStream so it is always using non-blocking IO
and thus the entire content for a FlowFile is not always in memory unless the
processor author wishes it.
However what think the issue the original question to the mailing list is
trying skate around is that the content and provenance repos can grow quickly
when the contents of a FlowFile is modified.
Now like [~malthe] has said one option would be to do some sort of shunt that
would allow for de-compression on demand once the content has be run through
the CompressContent processor quite how this would work I don't know, I suspect
it would involve quite a bit of playing with the internals however. The other
approach I could see is to modify either the default content repo
implementation, the loading of content from the repo, or both. To enable
compression and de-compresson of the content when it's loaded under certain
circumstances. Now there would be of-course trade off's and there is also the
"per-flow settings" question as well and therefore needs to be a discussion
about this more in general.
> Add compression support to record reader processor
> --------------------------------------------------
>
> Key: NIFI-6496
> URL: https://issues.apache.org/jira/browse/NIFI-6496
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Malthe Borch
> Priority: Minor
> Labels: easyfix, usability
>
> Text-based record formats such as CSV, JSON and XML compress well and will
> often be transmitted in a compressed format. If compression support is added
> to the relevant processors, users will not need to explicitly unpack files
> before processing (which may not be feasible or practical due to space
> requirements).
> There are at least two ways of implementing this, using either a generic
> approach where a {{CompressedRecordReaderFactory}} is the basis for a new
> controller service that wraps the underlying record reader controller service
> (e.g. {{CSVReader}}); or adding the functionality at the relevant record
> reader implementations.
> The latter option may provide a better UX because no additional
> {{ControllerService}} has to be configured.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)