Hi I recently had a look at improving the XSLT, XQuery and XPath components in Camel.
For example these first two of these components now supports StAX as Source. And prefer StAX/SAX over DOM etc. For StAX you will need to enable it using allowStAX option (to be backwards compatible) The latter (XPath) does not support this, because its javax API is limited. Likewise the XPath engine in the JDK does not support streaming, so we end up loading the content into a DOM in memory. So this means that when people are trying to split a big XML file with XPath in Camel, they hit OOME or have a solution that eats up memory and the system becomes slower. The solution is to build a custom expression that will iterate the file source in pieces and do the "XPath splitting" manually. So I have enhanced the tokenizer language in Camel so it can do this for you. See the sections: - strem based - streaming big XML payloads using Tokenizer language at http://camel.apache.org/splitter The idea is that you provide a start and end token, and then the tokenizer will chop the payload by grabbing the content between those tokens. All in a streamed fashion using the java.util.Scanner from the JDK. I added some unit tests to simulate big data and to output performance in camel-core - TokenPairIteratorSplitChoicePerformanceTest - XPathSplitChoicePerformanceTest As well in camel-saxon we have a unit test as well - XPathSplitChoicePerformanceTest I noticed Saxon is faster than the JDK XPath engine, but they both eat up memory. I looked at Saxon and they are starting to support streaming but only in their EE version (which you need to buy a license for) and the streaming seems to be XSTL specific at first. (Not XPath). I also added a INFO logging in the XPathBuilder so it logs once when it initializes the XPathFactory. This allows you to know which factory is used INFO XPathBuilder - Created default XPathFactory com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f For example if you have Saxon on the classpath it may use that instead. For example to split 40.000 elements using the JDK XPath Engine - Processed file with 40000 elements in: 45.521 seconds (uses about 98mb) And 40.000 elements with the tokenizer - Processed file with 40000 elements in: 47.291 seconds (uses about 6mb) And 200.000 elements with the tokenizer - Processed file with 200000 elements in: 3 minutes (uses about 14mb) I could not run the 200.000 elements with XPath as it hit OOME (unless I bump up the JVM memory allocations a lot) So its not really about speed, but about memory usages. The tokenizer is very low memory usages, where as XPath will just keep eating memory. Now if the XML data was very big then only the tokenizer would be able to split the file. The tokenizer is of course not using a real XPath expression, so you can only split by chopping out a "record" of you XML file. But if you structure your XML data as follows, then the tokenizer can handle it: <records> <record id="1"> </record> <record id="2"> </record> <record id="3"> </record> .... <record id="N"> </record> </records> Also the tokenizer can support non XML as well, in case you have special START/END tokens for your records. What about other XPath libraries? Yes there is a few out there. Some is not so active maintained (I guess some the XML hyper is over now) and others have a GPL license or other kind of license that prevents us to use it at Apache http://www.apache.org/legal/3party.html#define-thirdpartywork -- Claus Ibsen ----------------- FuseSource Email: cib...@fusesource.com Web: http://fusesource.com Twitter: davsclaus, fusenews Blog: http://davsclaus.blogspot.com/ Author of Camel in Action: http://www.manning.com/ibsen/