[
https://issues.apache.org/jira/browse/MINIFI-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936788#comment-15936788
]
Andrew Christianson commented on MINIFI-244:
--------------------------------------------
"It would require a set of components which do lens type operations on a
certain set of archive types which would duplicate other similar ones which
expect the data already split out."
With regard to duplication, I'm interpreting that as duplication of processors
such as "extract" or "pack/unpack." Is that correct? Because all other
operations (transforms of the focused components) would all be done with 100%
vanilla/standard processors with no modification from their current state.
Assuming we are talking duplication with Unpack/Pack, then I think that the
lens vs. extracting processors are semantically different enough to justify
their own existences.
With regard to lens-transformed items interacting with non-lens-aware
processors, the design intent is to be fully backward-compatible and avoid any
dangerous or confusing error scenarios for users, even when they do the wrong
thing. All lens-based processors are expected to report error messages if they
receive invalid input, as would normally occur in a NiFi processor. If a
focused entry in an archive (such as an embedded XML file) is fed into a
non-lens aware processor that hops out of lens-aware state, such as PostHTTP,
then the data item is implicitly downgraded to a state where the focused item
becomes the item. In other words, a focus operation becomes an extract
operation. I think that would be least surprising to users.
"We could possibly have a ControllerService instead which defines the lens and
understands a given archive type and then have the base processors built to
utilize that lens instead of just going after the raw object. Not sure but I
think there is a concept there to play with."
This gets into the implementation part, which gets interesting. When it comes
to flow files which have been lens-transformed, going after the raw object is,
I think, exactly what we want to happen. That is what makes it 100% compatible
with the entire existing base of processors. The intent is for ordinary
processors to be completely oblivious that they're operating on something that
has a higher dimensionality to it. This is also what makes it conceptually
simple to use --you focus a part, and that part is what gets operated on until
something else is focused. Although I wouldn't necessarily implement it this
way, you could think of a LensFlowFileRecord being a supertype of FlowFile.
While a ControllerService could indeed get the job done by keeping track which
flow files have a focus applied to them, and of the associated information and
context required to change focus later on, I think there is a nice simplicity
to allowing the extra information required to be kept with the flow file
itself. The primary advantages are: ability to be serialized along with the
flow file record, ability to be transferred in site-to-site, ability to be
packaged in a future version of the NiFi flow file package format, and better
alignment with parallel processing paradigms (no state access contention or
bottleneck).
The best way I've come up with after some thought is to make one small change
to ProcessSession. In addition to the usual read/write/attribute operations, I
would suggest two new methods:
- stash(key)
- restore(key)
This is analogous to stash/restore in git. Therefore, to implement
FocusArchiveEntry, the content is read, the focus item extracted, then the
original content is stashed. The focused item becomes the content. Then, to
re-focus the higher-up (or root-level) part of the archive, the current (focus)
content is moved away, original content restored, then the focused content is
re-incorporated back into the archive. Content is stashed according to key, to
identify the state itself, and to avoid collision with other processes that
might use stash/restore. This is kind of similar to how JMS processors have
"jms.*" attributes. All of this, including stashed content, manipulating of the
actual archive, etc. is completely hidden from the flow definition and the user
because it's all implementation detail.
With that change to ProcessSession, plus adding relevant code to the
serialization/deserialization of flow files, we would then have everything we
need to implement the Focus* processors. We do also open up a pandora's box of
new processor types that could be created, but I think it is innocent enough
and has some successful precedence with git.
To sum up, I hope I have assuaged the concern with duplicated functionality or
flow concepts, plus laid out a reasonable, low-impact implementation. There
are, of course, many ways something like this could be implemented, including
adding some non-invasive extension hooks which could allow us to quarantine the
whole thing as a toggle-able extension. I tend to like the stash/restore as it
seems to fit with the overall NiFi style and methodology, and could have many
other valuable, legitimate uses. I have thought of other implementation
possibilities such as blob attributes or storing state externally, but the
downsides of these are hard to swallow, especially in the context of MiNiFi
where we are going for lightweight and efficient.
[~joewitt] [~aldrin] and others, let me know if you have further concerns on
the concept/language, if you see a way to make it fit in the standard
processors (vs. more of a separate extension), and if you have any concerns or
input on the proposed implementation (adding stash/restore to the
ProcessSession API).
> Create ArchiveLens processor
> ----------------------------
>
> Key: MINIFI-244
> URL: https://issues.apache.org/jira/browse/MINIFI-244
> Project: Apache NiFi MiNiFi
> Issue Type: Task
> Components: C++, Extensions
> Reporter: Andrew Christianson
> Assignee: Andrew Christianson
> Priority: Minor
>
> Create an ArchiveLens processor. A concise, though informal, definition of a
> lens is as follows:
> "Essentially, they represent the act of “peering into” or “focusing in on”
> some particular piece/path of a complex data object such that you can more
> precisely target particular operations without losing the context or
> structure of the overall data you’re working with."
> https://medium.com/@dtipson/functional-lenses-d1aba9e52254#.hdgsvbraq
> Why an ArchiveLens in MiNiFi? Simply put, it will enable us to "focus in on"
> an entry in the archive, perform processing *in-context* of that entry, then
> re-focus on the overall archive. This allows for transformation or other
> processing of an entry in the archive without losing the overall context of
> the archive.
> Initial format support is tar, due to its simplicity and ubiquity.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)