[jira] [Commented] (MINIFI-244) Create ArchiveLens processor

Andrew Christianson (JIRA) Wed, 22 Mar 2017 10:54:04 -0700

    [ 
https://issues.apache.org/jira/browse/MINIFI-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936788#comment-15936788
 ]


Andrew Christianson commented on MINIFI-244:
--------------------------------------------

"It would require a set of components which do lens type operations on a 
certain set of archive types which would duplicate other similar ones which 
expect the data already split out."

With regard to duplication, I'm interpreting that as duplication of processors 
such as "extract" or "pack/unpack." Is that correct? Because all other 
operations (transforms of the focused components) would all be done with 100% 
vanilla/standard processors with no modification from their current state. 
Assuming we are talking duplication with Unpack/Pack, then I think that the 
lens vs. extracting processors are semantically different enough to justify 
their own existences.

With regard to lens-transformed items interacting with non-lens-aware 
processors, the design intent is to be fully backward-compatible and avoid any 
dangerous or confusing error scenarios for users, even when they do the wrong 
thing. All lens-based processors are expected to report error messages if they 
receive invalid input, as would normally occur in a NiFi processor. If a 
focused entry in an archive (such as an embedded XML file) is fed into a 
non-lens aware processor that hops out of lens-aware state, such as PostHTTP, 
then the data item is implicitly downgraded to a state where the focused item 
becomes the item. In other words, a focus operation becomes an extract 
operation. I think that would be least surprising to users.

"We could possibly have a ControllerService instead which defines the lens and 
understands a given archive type and then have the base processors built to 
utilize that lens instead of just going after the raw object. Not sure but I 
think there is a concept there to play with."

This gets into the implementation part, which gets interesting. When it comes 
to flow files which have been lens-transformed, going after the raw object is, 
I think, exactly what we want to happen. That is what makes it 100% compatible 
with the entire existing base of processors. The intent is for ordinary 
processors to be completely oblivious that they're operating on something that 
has a higher dimensionality to it. This is also what makes it conceptually 
simple to use --you focus a part, and that part is what gets operated on until 
something else is focused. Although I wouldn't necessarily implement it this 
way, you could think of a LensFlowFileRecord being a supertype of FlowFile.

While a ControllerService could indeed get the job done by keeping track which 
flow files have a focus applied to them, and of the associated information and 
context required to change focus later on, I think there is a nice simplicity 
to allowing the extra information required to be kept with the flow file 
itself. The primary advantages are: ability to be serialized along with the 
flow file record, ability to be transferred in site-to-site, ability to be 
packaged in a future version of the NiFi flow file package format, and better 
alignment with parallel processing paradigms (no state access contention or 
bottleneck).

The best way I've come up with after some thought is to make one small change 
to ProcessSession. In addition to the usual read/write/attribute operations, I 
would suggest two new methods:

- stash(key)
- restore(key)

This is analogous to stash/restore in git. Therefore, to implement 
FocusArchiveEntry, the content is read, the focus item extracted, then the 
original content is stashed. The focused item becomes the content. Then, to 
re-focus the higher-up (or root-level) part of the archive, the current (focus) 
content is moved away, original content restored, then the focused content is 
re-incorporated back into the archive. Content is stashed according to key, to 
identify the state itself, and to avoid collision with other processes that 
might use stash/restore. This is kind of similar to how JMS processors have 
"jms.*" attributes. All of this, including stashed content, manipulating of the 
actual archive, etc. is completely hidden from the flow definition and the user 
because it's all implementation detail.

With that change to ProcessSession, plus adding relevant code to the 
serialization/deserialization of flow files, we would then have everything we 
need to implement the Focus* processors. We do also open up a pandora's box of 
new processor types that could be created, but I think it is innocent enough 
and has some successful precedence with git.

To sum up, I hope I have assuaged the concern with duplicated functionality or 
flow concepts, plus laid out a reasonable, low-impact implementation. There 
are, of course, many ways something like this could be implemented, including 
adding some non-invasive extension hooks which could allow us to quarantine the 
whole thing as a toggle-able extension. I tend to like the stash/restore as it 
seems to fit with the overall NiFi style and methodology, and could have many 
other valuable, legitimate uses. I have thought of other implementation 
possibilities such as blob attributes or storing state externally, but the 
downsides of these are hard to swallow, especially in the context of MiNiFi 
where we are going for lightweight and efficient.

[~joewitt] [~aldrin] and others, let me know if you have further concerns on 
the concept/language, if you see a way to make it fit in the standard 
processors (vs. more of a separate extension), and if you have any concerns or 
input on the proposed implementation (adding stash/restore to the 
ProcessSession API).

> Create ArchiveLens processor
> ----------------------------
>
>                 Key: MINIFI-244
>                 URL: https://issues.apache.org/jira/browse/MINIFI-244
>             Project: Apache NiFi MiNiFi
>          Issue Type: Task
>          Components: C++, Extensions
>            Reporter: Andrew Christianson
>            Assignee: Andrew Christianson
>            Priority: Minor
>
> Create an ArchiveLens processor. A concise, though informal, definition of a 
> lens is as follows:
> "Essentially, they represent the act of “peering into” or “focusing in on” 
> some particular piece/path of a complex data object such that you can more 
> precisely target particular operations without losing the context or 
> structure of the overall data you’re working with." 
> https://medium.com/@dtipson/functional-lenses-d1aba9e52254#.hdgsvbraq
> Why an ArchiveLens in MiNiFi? Simply put, it will enable us to "focus in on" 
> an entry in the archive, perform processing *in-context* of that entry, then 
> re-focus on the overall archive. This allows for transformation or other 
> processing of an entry in the archive without losing the overall context of 
> the archive.
> Initial format support is tar, due to its simplicity and ubiquity.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MINIFI-244) Create ArchiveLens processor

Reply via email to