Hi Sylvain and Simone, thank you a lot, the suggestions you provided are all very very interesting, so I wonder now if it is possible to realize a processor able to use at the same time the Tika way when it recognizes some kind of paths, the "XSL-on-the-fly" for more complex cases. What do you think?
Sylvain, I still haven't read the Tika documentation, can you just point me the related doc about this topic? Simo, did you already give a try about the XSLT generation on the fly? The most basic operation I thought is generating the XSL string by a template, then pass it to the XSL parser, but I'm sure it could be implemented in a better way :P Every suggestion will be very appreciated, thanks in advance Best regards, have a nice evening!!! Simone On Mon, Nov 23, 2009 at 7:16 PM, Sylvain Wallez <[email protected]> wrote: > Simone Gianni wrote: >> >> Hi Simone and Sylvain, >> aren't XSLT transformers already SAX/Xpath optimized? I mean, an XSLT >> containing an XPath expression and used in a SAX context, isn't already able >> to resolve the XPath while keeping buffering at the minimum possible? >> >> I can clearly remember that there has been a lot of work about this in >> Xalan and other XSLT engines, and also how a complex XPath expressions could >> change the performance of a transformation because of increased buffering. > > Xalan has an optimized implementation of the document tree [1], more > efficient than the standard DOM for read-only and selection operations. > Xalan has an incremental processing mode, but IIRC it's more about being > able to produce some output before the whole document has been read rather > than avoiding to build parts of the document tree. So it will allow for > faster processing, but won't change memory consumption. > >> In that case, maybe, instead of reinventing it, it should be possible to >> delegate the "transformation" (extraction of a fragment from the entire XML >> stream) to an XSLT processor. The simplest way could be to generate an XSLT >> on the fly :) .. the correct way would be to use the [Xalan|Saxon|any other] >> internal APIs to perform the XPath resolution. In both cases, it will be >> faster than transforming to DOM. > > Agree. It may be easier to produce a small XSL transformation from the > XPointer expression than using Axiom. But still, for simple expressions, the > pure streaming approach used by Tika would be way more efficient. > > Sylvain > > [1] http://xml.apache.org/xalan-j/dtm.html > > -- > Sylvain Wallez - http://bluxte.net > > -- http://www.google.com/profiles/simone.tripodi
