I don’t think handler customization generally is possible in Tika-server.
What happens w /rmeta/body? On Wed, Jun 23, 2021 at 2:27 PM Nicholas DiPiazza < [email protected]> wrote: > When we are using the Tika-Server and parsing an html > > <html><title>hi there</title><body>woah</body></html> > > The parser when called through the endpoing: > > http://localhost:49309/rmeta/text > > Will give you a basic result like this: > > [ > { > "Content-Encoding": "ISO-8859-1", > "Content-Type": "text/html; charset=ISO-8859-1", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.html.HtmlParser" > ], > "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah", > "X-TIKA:content_handler": "ToTextContentHandler", > "X-TIKA:embedded_depth": "0", > "X-TIKA:parse_time_millis": "284", > "dc:title": "hi there", > "title": "hi there" > } > ] > > Notice how the title is in the body content. > > When using tika embedded in a java app, I know if you extend Tika's default > handler you can customize the XHTML attributes such as <title> so that you > could, for example, make it so that the content field does not have the > title in it. > > Does anyone know when using Tika Server if there is a similar thing > possible? >
