[jira] [Commented] (NIFI-7480) Allow SplitXML processor to generate XML fragments without loading entire XML into memory

Mark Payne (Jira) Fri, 22 May 2020 10:27:23 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114267#comment-17114267
 ]


Mark Payne commented on NIFI-7480:
----------------------------------

Quickly glancing at the processor, it looks like the processor's documentation 
is incorrect. The documentation claims that the entire XML document is loaded 
into memory as a DOM object. However, this is not the case. The XML is parsed 
using a SAX (streaming) parser, so it does not need to load the entire document 
into memory. The documentation should be fixed.

That said, the processor does generate a lot of FlowFiles potentially, which 
can take a huge amount of memory also, so a 2-phase approach may be necessary 
if splitting at a level deeper than 1.

However, it is generally best to avoid splitting XML documents and instead use 
Record-based processors if at all possible. Splitting the data apart puts 
dramatically more stress on the nifi framework and as a result record-based 
processors tend to perform about 10x better.

> Allow SplitXML processor to generate XML fragments without loading entire XML 
> into memory
> -----------------------------------------------------------------------------------------
>
>                 Key: NIFI-7480
>                 URL: https://issues.apache.org/jira/browse/NIFI-7480
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Swarup Karavadi
>            Priority: Minor
>
> The current behaviour of the SplitXML processor (as documented 
> [here|[http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.11.4/org.apache.nifi.processors.standard.SplitXml/index.html]])
>  is to load the entire XML file in memory and then split the document into 
> fragments. This can get very memory intensive when processing large files. 
> I was wondering if it is possible to stream the file and construct XML 
> fragments (based on split depth). I understand there might be some issues 
> around this - 
>  * setting the fragment.count attribute for the flow file containing the XML 
> fragment
>  * recovering from failures (ie., at what point during the processing should 
> checkpoints be committed, etc)
> Thought it was worth bringing this up to see if this is something worth 
> picking up or even possible at all on the NiFi platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-7480) Allow SplitXML processor to generate XML fragments without loading entire XML into memory

Reply via email to