[ https://issues.apache.org/jira/browse/NIFI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114267#comment-17114267 ]
Mark Payne commented on NIFI-7480: ---------------------------------- Quickly glancing at the processor, it looks like the processor's documentation is incorrect. The documentation claims that the entire XML document is loaded into memory as a DOM object. However, this is not the case. The XML is parsed using a SAX (streaming) parser, so it does not need to load the entire document into memory. The documentation should be fixed. That said, the processor does generate a lot of FlowFiles potentially, which can take a huge amount of memory also, so a 2-phase approach may be necessary if splitting at a level deeper than 1. However, it is generally best to avoid splitting XML documents and instead use Record-based processors if at all possible. Splitting the data apart puts dramatically more stress on the nifi framework and as a result record-based processors tend to perform about 10x better. > Allow SplitXML processor to generate XML fragments without loading entire XML > into memory > ----------------------------------------------------------------------------------------- > > Key: NIFI-7480 > URL: https://issues.apache.org/jira/browse/NIFI-7480 > Project: Apache NiFi > Issue Type: Improvement > Reporter: Swarup Karavadi > Priority: Minor > > The current behaviour of the SplitXML processor (as documented > [here|[http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.11.4/org.apache.nifi.processors.standard.SplitXml/index.html]]) > is to load the entire XML file in memory and then split the document into > fragments. This can get very memory intensive when processing large files. > I was wondering if it is possible to stream the file and construct XML > fragments (based on split depth). I understand there might be some issues > around this - > * setting the fragment.count attribute for the flow file containing the XML > fragment > * recovering from failures (ie., at what point during the processing should > checkpoints be committed, etc) > Thought it was worth bringing this up to see if this is something worth > picking up or even possible at all on the NiFi platform. -- This message was sent by Atlassian Jira (v8.3.4#803005)