I don’t think handler customization generally is possible in Tika-server.

What happens w /rmeta/body?

On Wed, Jun 23, 2021 at 2:27 PM Nicholas DiPiazza <
[email protected]> wrote:

> When we are using the Tika-Server and parsing an html
>
> <html><title>hi there</title><body>woah</body></html>
>
> The parser when called through the endpoing:
>
> http://localhost:49309/rmeta/text
>
> Will give you a basic result like this:
>
> [
> {
> "Content-Encoding": "ISO-8859-1",
> "Content-Type": "text/html; charset=ISO-8859-1",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah",
> "X-TIKA:content_handler": "ToTextContentHandler",
> "X-TIKA:embedded_depth": "0",
> "X-TIKA:parse_time_millis": "284",
> "dc:title": "hi there",
> "title": "hi there"
> }
> ]
>
> Notice how the title is in the body content.
>
> When using tika embedded in a java app, I know if you extend Tika's default
> handler you can customize the XHTML attributes such as <title> so that you
> could, for example, make it so that the content field does not have the
> title in it.
>
> Does anyone know when using Tika Server if there is a similar thing
> possible?
>

Reply via email to