[jira] [Resolved] (TIKA-2814) Extracted content of EML file contains words like "FONT-SIZE: 9pt; FONT-FAMILY: arial"

Tim Allison (JIRA) Thu, 17 Jan 2019 08:29:11 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison resolved TIKA-2814.
-------------------------------
    Resolution: Not A Bug

This is the expected behavior in Solr's ExtractingRequestHandler.

See SolrContentHandler (note {{captureAttribs}} by default is "false", which 
means attributes get written to the content):
{noformat}
  @Override
  public void startElement(String uri, String localName, String qName, 
Attributes attributes) throws SAXException {
    StringBuilder theBldr = fieldBuilders.get(localName);
    if (theBldr != null) {
      //we need to switch the currentBuilder
      bldrStack.add(theBldr);
    }
    if (captureAttribs == true) {
      for (int i = 0; i < attributes.getLength(); i++) {
        addField(localName, attributes.getValue(i), null);
      }
    } else {
      for (int i = 0; i < attributes.getLength(); i++) {
        bldrStack.getLast().append(' ').append(attributes.getValue(i));
      }
    }
    bldrStack.getLast().append(' ');
  }

{noformat}

If you set set the {{captureAttr}} to {{true}}, you should get the behavior you 
want:

{noformat}
 "id":"three",
        "subject":["TravelInsuranceEurope.com"],
        "stream_name":["testEML_contentHtml.eml"],
        "Content-Type":["message/rfc822"],
        "stream_size":["6759"],
        "extractedCreator":["Edwin Yeo <[email protected]>"],
        
"stream_source_info":["file:/C:/Users/tallison/Idea%20Projects/my-lucene-solr-fork/idea-build/solr/contrib/solr-cell/classes/test/extraction/testEML_contentHtml.eml"],
        "extractedDate":["2018-12-18T11:13:28Z"],
        "extractedAuthor":["Edwin Yeo <[email protected]>"],
        "extractedContent":[" \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n 
 \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n 
 \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n 
TravelInsuranceEurope.com \n \n    Hi There,   \n\n   My client owns the domain 
name “    TravelInsuranceEurope.com    ” and is considering putting it in 
market. It is keyword rich domain with good search volume,adword bidding and 
type-in-traffic.   \n\n   Based on our extensive study, we strongly feel that 
you should consider buying this domain name to improve the SEO, Online 
visibility, brand image, authority and type-in-traffic for your business. We 
also do provide free 1 year hosting and unlimited emails along with domain 
name.   \n\n   Besides this, if you need any other domain name, web and app 
designing services and digital marketing services (SEO, PPC and SMO) at 
reasonable charges, feel free to contact us.   \n\n   Best Regards,   \n\n   
Josh   \n\n  "],
        "multiDefault":["muLti-Default"],
        "intDefault":42,
        "timestamp":"2019-01-17T16:24:25.032Z"}]
  }}
{noformat}

from this (non)test:
{noformat}
  @Test
  public void testEML() throws Exception {
    loadLocal("extraction/testEML_contentHtml.eml",
        "fmap.created", "extractedDate",
        "fmap.producer", "extractedProducer",
        "fmap.creator", "extractedCreator",
        "fmap.Keywords", "extractedKeywords",
        "fmap.Author", "extractedAuthor",
        "literal.id", "three",
        "fmap.content", "extractedContent",
        "fmap.language", "extractedLanguage",
        "fmap.Creation-Date", "extractedDate",
        "uprefix", "ignored_",
        "fmap.Last-Modified", "extractedDate",
        "captureAttr", "true"
        );
    assertU(commit());
    System.out.println(JQ(req("*:*")));
  }
{noformat}

> Extracted content of EML file contains words like "FONT-SIZE: 9pt; 
> FONT-FAMILY: arial"
> --------------------------------------------------------------------------------------
>
>                 Key: TIKA-2814
>                 URL: https://issues.apache.org/jira/browse/TIKA-2814
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17, 1.18
>         Environment: Source code in MailContentHandler.java, 
> handleInlineBodyPart() function
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: eml, extraction, parser
>
> When we are indexing EML file, the priority setting of TIka is using 
> text/html. However, it contains alot of words like "*FONT-SIZE: 9pt; 
> FONT-FAMILY: arial*" in the content, and all of these are not removed by 
> Tika, which makes the content very cluttered and unreadable.
>  
>  This is what is output in the content after being extracted by Tika:
> {{ \{{ "content":" font-size: 14pt; font-family: book antiqua, palatino, 
> serif; Hi There, <br><br> font-size: 14pt; font-family: book antiqua, 
> palatino, serif; My client owns the domain name “ font-size: 14pt; color: 
> #0000ff; font-family: arial black, sans-serif; TravelInsuranceEurope.com 
> font-size: 14pt; font-family: book antiqua, palatino, serif; ” and is 
> considering putting it in market. It is keyword rich domain with good search 
> volume,adword bidding and type-in-traffic. <br><br> font-size: 14pt; 
> font-family: book antiqua, palatino, serif; Based on our extensive study, we 
> strongly feel that you should consider buying this domain name to improve the 
> SEO, Online visibility, brand image, authority and type-in-traffic for your 
> business. We also do provide free 1 year hosting and unlimited emails along 
> with domain name. <br><br> font-size: 14pt; font-family: book antiqua, 
> palatino, serif; Besides this, if you need any other domain name, web and app 
> designing services and digital marketing services (SEO, PPC and SMO) at 
> reasonable charges, feel free to contact us. <br><br> font-size: 14pt; 
> font-family: book antiqua, palatino, serif; Best Regards, <br><br> font-size: 
> 14pt; font-family: book antiqua, palatino, serif; Josh <br><br>"}}}}
>  
> In the MailContentHandler.java code, under the function 
> handleInlineBodyPart(), for MediaType.TEXT_HTML, it is using the 
> HtmlParser.class, However, this parser is not doing the job of removing 
> "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", and all these get output to the 
> content. We should resolve the issue with this HtmlParser so that it is able 
> to remove those tag, and make the content readable after extraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (TIKA-2814) Extracted content of EML file contains words like "FONT-SIZE: 9pt; FONT-FAMILY: arial"

Reply via email to