> is possible in tika-server

Currently, but this has been on my wishlist forever…

On Wed, Jun 23, 2021 at 2:35 PM Tim Allison <[email protected]> wrote:

> I don’t think handler customization generally is possible in Tika-server.
>
> What happens w /rmeta/body?
>
> On Wed, Jun 23, 2021 at 2:27 PM Nicholas DiPiazza <
> [email protected]> wrote:
>
>> When we are using the Tika-Server and parsing an html
>>
>> <html><title>hi there</title><body>woah</body></html>
>>
>> The parser when called through the endpoing:
>>
>> http://localhost:49309/rmeta/text
>>
>> Will give you a basic result like this:
>>
>> [
>> {
>> "Content-Encoding": "ISO-8859-1",
>> "Content-Type": "text/html; charset=ISO-8859-1",
>> "X-Parsed-By": [
>> "org.apache.tika.parser.DefaultParser",
>> "org.apache.tika.parser.html.HtmlParser"
>> ],
>> "X-TIKA:content": "\n\n\n\n\n\n\nhi there\n\nwoah",
>> "X-TIKA:content_handler": "ToTextContentHandler",
>> "X-TIKA:embedded_depth": "0",
>> "X-TIKA:parse_time_millis": "284",
>> "dc:title": "hi there",
>> "title": "hi there"
>> }
>> ]
>>
>> Notice how the title is in the body content.
>>
>> When using tika embedded in a java app, I know if you extend Tika's
>> default
>> handler you can customize the XHTML attributes such as <title> so that you
>> could, for example, make it so that the content field does not have the
>> title in it.
>>
>> Does anyone know when using Tika Server if there is a similar thing
>> possible?
>>
>

Reply via email to