[ 
https://issues.apache.org/jira/browse/SOLR-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336563#comment-15336563
 ] 

ASF GitHub Bot commented on SOLR-8981:
--------------------------------------

Github user tballison commented on the issue:

    https://github.com/apache/lucene-solr/pull/44
  
    The XHTMLContentHandler adds <body> and </body>.  In out-of-the-box Tika 
with the DefaultHtmlMapper, "body" tags are not in the list of "SAFE_ELEMENTS", 
which means that the html's "body" tag is never passed through...so we don't 
see the doubling in Tika.
    
    The solution is to suppress the body tag in Solr's 
MostlyPassthroughHtmlMapper.


> Upgrade to Tika 1.13 when it is available
> -----------------------------------------
>
>                 Key: SOLR-8981
>                 URL: https://issues.apache.org/jira/browse/SOLR-8981
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Uwe Schindler
>            Priority: Minor
>
> Tika 1.13 should be out within a month.  This includes PDFBox 2.0.0 and a 
> number of other upgrades and improvements.  
> If there are any showstoppers in 1.13 from Solr's side or requests before we 
> roll 1.13, let us know.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to