Steve, First thanks for raising the discussion and its awesome that you've built your own processor to help leverage the xml calabash software. A couple of quick thoughts from a short scan:
- Processor naming: Try to come up with a name that is of the form 'verb' 'subject'. For this processor it seems like PipelineXML or ProcessXML is appropriate. It is in the description and tags for the processor that you'd want to put things like 'XProc', 'XMLCalabash', 'XML', etc.. - Thread safety: XML processing is notoriously slow. You probably want to go the extra mile to make it support multiple threads and remove trigger serially. You can either create the necessary XMLCalabash objects on demand with each trigger call or if this is expensive then you can operate on batches of flowfiles at a time (slightly less cool these streamy days) or you could have a small pool/cache of these objects which are lazily inited then reused on subsequent calls. All the necessary lifecycle hooks are in place on the processor for any of these patterns. - Interest in a contrib to the community: XML is indeed quite common and often people want to work with it. Provided there is a healthy contribution and all licensing and notice aspects are in order then I think we'd be quite happy to help you turn it into a contribution to the apache nifi community. If you decide not to go that route this is ok as well but obviously we'd like to help you contribute to the community itself if possible. - File based versus provided configuration: Consider allowing the user to enter/paste in a pipeline configuration directly into the property as an alternative to relying on a file reference. By having the configuration entered directly it greatly eases the burden on an administrator having to put that config somewhere on all systems in a nifi cluster and further it means the users through the web UI can easily tweak their pipelines. - Provide a sample configuration/template using it: It would be awesome if you could write a blog or something that shows this thing in all its glory. How to set it up, sample data, a pipeline, and the results. That would be very helpful. - Handling of 'original' flowfile Consider having an 'original' relationship which you send the original flowfile down rather than removing it in the session if all goes well. We've found that folks often like to use that relationship after the processing is successful or they can just terminate it. But it gives them the control. - Memory management Can you describe the memory management aspects of this processor? Will it load the original document in memory fully and will it have all outputs in memory at once? This is a common challenge with XML stuff. This will need to be well described on the processor so users can be careful to consider how many instances/threads/etc.. to use. I noticed you did a really nice job of accounting for flowfiles and ensuring provenance would work here. Nice job! Thanks Joe On Tue, Mar 7, 2017 at 10:17 AM, Steve Lawrence <[email protected]> wrote: > We have developed a NiFi processor that uses XMLCalabash [1] to add > support for XProc [2] processing. XProc is an XML transformation > language that defines and XML pipeline, allowing for complex validation, > transformation, and routing of XML data within the pipeline, using > existing XML technologies such as RelaxNG, Schematron, XSD Schema, > XQuery, XSLT, XPath and custom XProc transformations. > > This new processor is mostly straightforward, but we had some questions > regarding the specific implementation and the handling of non-thread > safe code. The code is available for viewing here: > > > https://opensource.ncsa.illinois.edu/bitbucket/projects/DFDL/repos/nifi-xproc/browse > > In this processor, a property is created to provide an XProc file, which > defines the pipeline input and output "ports". XML goes into an input > port, goes through the pipeline, and one or more XML documents exit at > specified output ports. This NiFi processor maps each output port to a > dynamic NiFi relationship. It does this mapping in the > onPropertyModified method when the XProc file property is changed. This > method also stores the XMLCalabash XRuntime and XPipeline objects (which > do all the pipeline work) in volatile member variables to be used later > in onTrigger. The members are saved here to avoid recreating them in > each call to onTrigger. Is this an acceptable place to do that? It seems > this normally happens in an @OnScheduled method or in the first call to > onTrigger, however the objects must be created in onPropertyModified to > get the output ports, so this does avoid recreating the same objects > multiple times. Also note that the same objects are created in the > XML_PIPELINE_VALIDATOR but are not saved due to the validator being > static, so there is already some duplication. Is there a standard way to > avoid duplication/is this an acceptable way to handle this? > > The other concern we have is that the XPipeline and XRuntime objects > created by XML Calabash are not thread safe. To resolve this issue, the > processor is annotated with @TriggerSerially. Is this the correct > solution, or is there a some other preferred method. Perhaps ThreadLocal > or a thread safe pool of XPipeline objects is preferred? > > Lastly, is this something the devs would be interested in pulling into > NiFI, and if not, what could be changed to achieve this? The code is > licensed as Apache v2 and we would be happy to contribute the code to > NiFi if deemed acceptable. > > Thanks, > - Steve > > [1] http://xmlcalabash.com/ > [2] https://www.w3.org/TR/xproc/
