Re: NiFI XProc Processor

Joe Witt Tue, 07 Mar 2017 07:49:04 -0800

Steve,

First thanks for raising the discussion and its awesome that you've
built your own processor to help leverage the xml calabash software.
A couple of quick thoughts from a short scan:


- Processor naming:
  Try to come up with a name that is of the form 'verb' 'subject'.
For this processor it seems like PipelineXML or ProcessXML is
appropriate.  It is in the description and tags for the processor that
you'd want to put things like 'XProc', 'XMLCalabash', 'XML', etc..

- Thread safety:
  XML processing is notoriously slow.  You probably want to go the
extra mile to make it support multiple threads and remove trigger
serially.  You can either create the necessary XMLCalabash objects on
demand with each trigger call or if this is expensive then you can
operate on batches of flowfiles at a time (slightly less cool these
streamy days) or you could have a small pool/cache of these objects
which are lazily inited then reused on subsequent calls.  All the
necessary lifecycle hooks are in place on the processor for any of
these patterns.

- Interest in a contrib to the community:
  XML is indeed quite common and often people want to work with it.
Provided there is a healthy contribution and all licensing and notice
aspects are in order then I think we'd be quite happy to help you turn
it into a contribution to the apache nifi community.  If you decide
not to go that route this is ok as well but obviously we'd like to
help you contribute to the community itself if possible.

- File based versus provided configuration:
  Consider allowing the user to enter/paste in a pipeline
configuration directly into the property as an alternative to relying
on a file reference.  By having the configuration entered directly it
greatly eases the burden on an administrator having to put that config
somewhere on all systems in a nifi cluster and further it means the
users through the web UI can easily tweak their pipelines.

- Provide a sample configuration/template using it:
  It would be awesome if you could write a blog or something that
shows this thing in all its glory.  How to set it up, sample data, a
pipeline, and the results.  That would be very helpful.

- Handling of 'original' flowfile
  Consider having an 'original' relationship which you send the
original flowfile down rather than removing it in the session if all
goes well. We've found that folks often like to use that relationship
after the processing is successful or they can just terminate it.  But
it gives them the control.

- Memory management
  Can you describe the memory management aspects of this processor?
Will it load the original document in memory fully and will it have
all outputs in memory at once?  This is a common challenge with XML
stuff.  This will need to be well described on the processor so users
can be careful to consider how many instances/threads/etc.. to use.

I noticed you did a really nice job of accounting for flowfiles and
ensuring provenance would work here.  Nice job!

Thanks
Joe

On Tue, Mar 7, 2017 at 10:17 AM, Steve Lawrence <[email protected]> wrote:
> We have developed a NiFi processor that uses XMLCalabash [1] to add
> support for XProc [2] processing. XProc is an XML transformation
> language that defines and XML pipeline, allowing for complex validation,
> transformation, and routing of XML data within the pipeline, using
> existing XML technologies such as RelaxNG, Schematron, XSD Schema,
> XQuery, XSLT, XPath and custom XProc transformations.
>
> This new processor is mostly straightforward, but we had some questions
> regarding the specific implementation and the handling of non-thread
> safe code. The code is available for viewing here:
>
>
> https://opensource.ncsa.illinois.edu/bitbucket/projects/DFDL/repos/nifi-xproc/browse
>
> In this processor, a property is created to provide an XProc file, which
> defines the pipeline input and output "ports". XML goes into an input
> port, goes through the pipeline, and one or more XML documents exit at
> specified output ports. This NiFi processor maps each output port to a
> dynamic NiFi relationship. It does this mapping in the
> onPropertyModified method when the XProc file property is changed. This
> method also stores the XMLCalabash XRuntime and XPipeline objects (which
> do all the pipeline work) in volatile member variables to be used later
> in onTrigger. The members are saved here to avoid recreating them in
> each call to onTrigger. Is this an acceptable place to do that? It seems
> this normally happens in an @OnScheduled method or in the first call to
> onTrigger, however the objects must be created in onPropertyModified to
> get the output ports, so this does avoid recreating the same objects
> multiple times. Also note that the same objects are created in the
> XML_PIPELINE_VALIDATOR but are not saved due to the validator being
> static, so there is already some duplication. Is there a standard way to
> avoid duplication/is this an acceptable way to handle this?
>
> The other concern we have is that the XPipeline and XRuntime objects
> created by XML Calabash are not thread safe. To resolve this issue, the
> processor is annotated with @TriggerSerially. Is this the correct
> solution, or is there a some other preferred method. Perhaps ThreadLocal
> or a thread safe pool of XPipeline objects is preferred?
>
> Lastly, is this something the devs would be interested in pulling into
> NiFI, and if not, what could be changed to achieve this? The code is
> licensed as Apache v2 and we would be happy to contribute the code to
> NiFi if deemed acceptable.
>
> Thanks,
> - Steve
>
> [1] http://xmlcalabash.com/
> [2] https://www.w3.org/TR/xproc/

Re: NiFI XProc Processor

Reply via email to