[jira] [Commented] (TIKA-2755) Allow Tika to skip extraction of tags in HTML

Tim Allison (JIRA) Wed, 17 Oct 2018 06:10:18 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653517#comment-16653517
 ]


Tim Allison commented on TIKA-2755:
-----------------------------------

Doh. My fault, not yours.  tika-server uses the BoilerpipeContentHandler for 
the /tika endpoint.  As you observe, this handler includes the markup.

The /rmeta/text endpoint uses the ToTextHandler and returns the content without 
the markup.
{noformat}
curl -T TestForImageTag.html http://localhost:9998/rmeta/text
[{"Content-Encoding":"windows-1252","Content-Type":"text/html; 
charset\u003dwindows-1252","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThis
 is a test\n","X-TIKA:parse_time_millis":"9"}]
{noformat}

The downside is that you then have to parse the json and extract the content.

Fellow devs, any idea why we use the BoilerPipeHandler in {{/tika}} and not the 
ToTextHandler?


> Allow Tika to skip extraction of <img> tags in HTML
> ---------------------------------------------------
>
>                 Key: TIKA-2755
>                 URL: https://issues.apache.org/jira/browse/TIKA-2755
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.19.1
>            Reporter: Harinder
>            Priority: Major
>         Attachments: TestForImageTag.html
>
>
> We are using Tika Server to extract text from HTML files. Tika extracts the 
> alt text of image tags present in HTML files as _[image: this is the alt text 
> of the image]_. This ends up in Solr and shows up in the results when we 
> generate document summaries at query time (via Solr’s highlight 
> functionality).
> If you PUT the attached HTML file to /tika, it will return the following 
> response
> {code:java}
> [image: Return to the homepage]
> This is a test{code}
> It would be nice to have just this instead
> {code:java}
> This is a test {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2755) Allow Tika to skip extraction of tags in HTML

Reply via email to