> is possible in tika-server Currently, but this has been on my wishlist forever…
On Wed, Jun 23, 2021 at 2:35 PM Tim Allison <[email protected]> wrote: > I don’t think handler customization generally is possible in Tika-server. > > What happens w /rmeta/body? > > On Wed, Jun 23, 2021 at 2:27 PM Nicholas DiPiazza < > [email protected]> wrote: > >> When we are using the Tika-Server and parsing an html >> >> <html><title>hi there</title><body>woah</body></html> >> >> The parser when called through the endpoing: >> >> http://localhost:49309/rmeta/text >> >> Will give you a basic result like this: >> >> [ >> { >> "Content-Encoding": "ISO-8859-1", >> "Content-Type": "text/html; charset=ISO-8859-1", >> "X-Parsed-By": [ >> "org.apache.tika.parser.DefaultParser", >> "org.apache.tika.parser.html.HtmlParser" >> ], >> "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah", >> "X-TIKA:content_handler": "ToTextContentHandler", >> "X-TIKA:embedded_depth": "0", >> "X-TIKA:parse_time_millis": "284", >> "dc:title": "hi there", >> "title": "hi there" >> } >> ] >> >> Notice how the title is in the body content. >> >> When using tika embedded in a java app, I know if you extend Tika's >> default >> handler you can customize the XHTML attributes such as <title> so that you >> could, for example, make it so that the content field does not have the >> title in it. >> >> Does anyone know when using Tika Server if there is a similar thing >> possible? >> >
