When we are using the Tika-Server and parsing an html <html><title>hi there</title><body>woah</body></html>
The parser when called through the endpoing: http://localhost:49309/rmeta/text Will give you a basic result like this: [ { "Content-Encoding": "ISO-8859-1", "Content-Type": "text/html; charset=ISO-8859-1", "X-Parsed-By": [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.html.HtmlParser" ], "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah", "X-TIKA:content_handler": "ToTextContentHandler", "X-TIKA:embedded_depth": "0", "X-TIKA:parse_time_millis": "284", "dc:title": "hi there", "title": "hi there" } ] Notice how the title is in the body content. When using tika embedded in a java app, I know if you extend Tika's default handler you can customize the XHTML attributes such as <title> so that you could, for example, make it so that the content field does not have the title in it. Does anyone know when using Tika Server if there is a similar thing possible?
