Simone Tripodi wrote:
Hi Sylvain and Simone,
thank you a lot, the suggestions you provided are all very very
interesting, so I wonder now if it is possible to realize a processor
able to use at the same time the Tika way when it recognizes some kind
of paths, the "XSL-on-the-fly" for more complex cases. What do you
think?

As I suggested previously: first try to parse the XPath expression with Tika's parser, and if it fails because the expression doesn't match the subset it accepts, fall back to XSL-on-the-fly.

Looking at Tika's parser [1], it looks like you'll have to overload the parse() method to fail hard by throwing an exception rather than returning Matcher.FAIL to be able to detect XPath features outside of the subset it accepts.

Sylvain, I still haven't read the Tika documentation, can you just
point me the related doc about this topic?

There's no specific documentation on this particular feature, as its more an internal utility than a primary feature in Tika. Now the code is pretty straightforward.
Simo, did you already give a try about the XSLT generation on the fly?
The most basic operation I thought is generating the XSL string by a
template, then pass it to the XSL parser, but I'm sure it could be
implemented in a better way :P

Sounds like the way to go, but you should cache the resulting template object to avoid recreating and reparsing the XSL at every request. The same applies to Tika matcher objects.

Sylvain

[1] https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/XPathParser.java

--
Sylvain Wallez - http://bluxte.net

Reply via email to