[ 
https://issues.apache.org/jira/browse/TIKA-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650682#comment-16650682
 ] 

Harinder edited comment on TIKA-2755 at 10/15/18 7:47 PM:
----------------------------------------------------------

I had seen that, except I missed the rather important Accept header.

Now, I must be doing something very silly. Even with "Accept: text/plain" 
header I am seeing  [image :] tags
{code:java}
$ curl -T TestForImageTag.html http://localhost:9998/tika --header "Accept: 
text/plain"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 140 0 48 100 92 123 235 --:--:-- --:--:-- --:--:-- 358
[image: Return to the homepage]
This is a test

{code}


was (Author: hanjan):
I had seen that, except I missed the rather important Accept header.

Now, I must be doing something very silly. Even with "Accept: text/plain" 
header I am seeing  [image :] tags
{noformat}
$ curl -T TestForImageTag.html http://localhost:9998/tika --header "Accept: 
text/plain"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 140 0 48 100 92 123 235 --:--:-- --:--:-- --:--:-- 358
[image: Return to the homepage]
This is a test

{noformat}

> Allow Tika to skip extraction of <img> tags in HTML
> ---------------------------------------------------
>
>                 Key: TIKA-2755
>                 URL: https://issues.apache.org/jira/browse/TIKA-2755
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.19.1
>            Reporter: Harinder
>            Priority: Major
>         Attachments: TestForImageTag.html
>
>
> We are using Tika Server to extract text from HTML files. Tika extracts the 
> alt text of image tags present in HTML files as _[image: this is the alt text 
> of the image]_. This ends up in Solr and shows up in the results when we 
> generate document summaries at query time (via Solr’s highlight 
> functionality).
> If you PUT the attached HTML file to /tika, it will return the following 
> response
> {code:java}
> [image: Return to the homepage]
> This is a test{code}
> It would be nice to have just this instead
> {code:java}
> This is a test {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to