When we are using the Tika-Server and parsing an html

<html><title>hi there</title><body>woah</body></html>

The parser when called through the endpoing:

http://localhost:49309/rmeta/text

Will give you a basic result like this:

[
{
"Content-Encoding": "ISO-8859-1",
"Content-Type": "text/html; charset=ISO-8859-1",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"
],
"X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah",
"X-TIKA:content_handler": "ToTextContentHandler",
"X-TIKA:embedded_depth": "0",
"X-TIKA:parse_time_millis": "284",
"dc:title": "hi there",
"title": "hi there"
}
]

Notice how the title is in the body content.

When using tika embedded in a java app, I know if you extend Tika's default
handler you can customize the XHTML attributes such as <title> so that you
could, for example, make it so that the content field does not have the
title in it.

Does anyone know when using Tika Server if there is a similar thing
possible?

Reply via email to