[
https://issues.apache.org/jira/browse/TIKA-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773660#comment-16773660
]
Finn Woelm commented on TIKA-2755:
----------------------------------
Thanks, [[email protected]]. You just solved a problem I've been trying to
figure out for a long time...
The */rmeta/text* endpoint is great. But it comes with a lot of line breaks
that are not part of the document body:
{code:java}
"X-TIKA:content":"\n\n\n\n\n\n\n\n\n\nThis is a test\n"
{code}
Any suggestions on removing all those line breaks that do not come from the
document body (and keeping only those that are part of the document body)? I
can't really identify which line breaks at the beginning are part of the body
and which are not :(
> Allow Tika to skip extraction of <img> tags in HTML
> ---------------------------------------------------
>
> Key: TIKA-2755
> URL: https://issues.apache.org/jira/browse/TIKA-2755
> Project: Tika
> Issue Type: Improvement
> Components: server
> Affects Versions: 1.19.1
> Reporter: Harinder
> Priority: Major
> Attachments: TestForImageTag.html
>
>
> We are using Tika Server to extract text from HTML files. Tika extracts the
> alt text of image tags present in HTML files as _[image: this is the alt text
> of the image]_. This ends up in Solr and shows up in the results when we
> generate document summaries at query time (via Solr’s highlight
> functionality).
> If you PUT the attached HTML file to /tika, it will return the following
> response
> {code:java}
> [image: Return to the homepage]
> This is a test{code}
> It would be nice to have just this instead
> {code:java}
> This is a test {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)