Simone Tripodi wrote:
Hi Sylvain and Simone,
thank you a lot, the suggestions you provided are all very very
interesting, so I wonder now if it is possible to realize a processor
able to use at the same time the Tika way when it recognizes some kind
of paths, the "XSL-on-the-fly" for more complex cases. What do you
think?
As I suggested previously: first try to parse the XPath expression with
Tika's parser, and if it fails because the expression doesn't match the
subset it accepts, fall back to XSL-on-the-fly.
Looking at Tika's parser [1], it looks like you'll have to overload the
parse() method to fail hard by throwing an exception rather than
returning Matcher.FAIL to be able to detect XPath features outside of
the subset it accepts.
Sylvain, I still haven't read the Tika documentation, can you just
point me the related doc about this topic?
There's no specific documentation on this particular feature, as its
more an internal utility than a primary feature in Tika. Now the code is
pretty straightforward.
Simo, did you already give a try about the XSLT generation on the fly?
The most basic operation I thought is generating the XSL string by a
template, then pass it to the XSL parser, but I'm sure it could be
implemented in a better way :P
Sounds like the way to go, but you should cache the resulting template
object to avoid recreating and reparsing the XSL at every request. The
same applies to Tika matcher objects.
Sylvain
[1]
https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/XPathParser.java
--
Sylvain Wallez - http://bluxte.net