[jira] [Comment Edited] (TIKA-2814) Extracted content of EML file contains words like "FONT-SIZE: 9pt; FONT-FAMILY: arial"

Edwin Yeo Zheng Lin (JIRA) Wed, 16 Jan 2019 08:33:36 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744228#comment-16744228
 ]


Edwin Yeo Zheng Lin edited comment on TIKA-2814 at 1/16/19 4:32 PM:
--------------------------------------------------------------------

I believe it is due to this part (from the code shown below) in tika-parser, 
from the file MailContentHandler.java.

The setting is coded to give text/html the highest score than text/plain. So 
this could be the reason why it will always get from text/html when it is 
available.

We should give text/plain the highest score, or allow user to choose which to 
give higher score?

 

{{{\{ private int score(Part part) {}}}}
{{ \{{ if (part == null) {}}}}
{{ \{{ return 0;}}}}
{{ \{{ }}}}}
{{ \{{ if (part instanceof BodyContents) {}}}}
{{ \{{ String contentType = 
((BodyContents)part).metadata.get(Metadata.CONTENT_TYPE);}}}}
{{ \{{ if (contentType == null) {}}}}
{{ \{{ return 0;}}}}
{{ \{{ } else if 
(contentType.equalsIgnoreCase(MediaType.TEXT_PLAIN.toString())) {}}}}
{{ \{{ return 1;}}}}
{{ \{{ } else if (contentType.equalsIgnoreCase("application/rtf")) {}}}}
{{ \{{ //TODO – is this the right definition in rfc822 for rich text?!}}}}
{{ \{{ return 2;}}}}
{{ \{{ } else if (contentType.equalsIgnoreCase(MediaType.TEXT_HTML.toString())) 
{}}}}
{{ \{{ return 3;}}}}
{{ \{{ }}}}}
{{ \{{ }}}}}
{{ \{{ return 4;}}}}
{{ \{{ }}}}}


was (Author: edwinyeozl):
I believe it is due to this part (from the code shown below) in tika-parser, 
from the file MailContentHandler.java.

The setting is coded to give text/html the highest score than text/plain. So 
this could be the reason why it will always get from text/html when it is 
available.

We should give text/plain the highest score, or allow user to choose which to 
give higher score?

 

{{ private int score(Part part) {}}
{{ if (part == null) {}}
{{ return 0;}}
{{ }}}
{{ if (part instanceof BodyContents) {}}
{{ String contentType = 
((BodyContents)part).metadata.get(Metadata.CONTENT_TYPE);}}
{{ if (contentType == null) {}}
{{ return 0;}}
{{ } else if (contentType.equalsIgnoreCase(MediaType.TEXT_PLAIN.toString())) {}}
{{ return 1;}}
{{ } else if (contentType.equalsIgnoreCase("application/rtf")) {}}
{{ //TODO -- is this the right definition in rfc822 for rich text?!}}
{{ return 2;}}
{{ } else if (contentType.equalsIgnoreCase(MediaType.TEXT_HTML.toString())) {}}
{{ return 3;}}
{{ }}}
{{ }}}
{{ return 4;}}
{{ }}}

> Extracted content of EML file contains words like "FONT-SIZE: 9pt; 
> FONT-FAMILY: arial"
> --------------------------------------------------------------------------------------
>
>                 Key: TIKA-2814
>                 URL: https://issues.apache.org/jira/browse/TIKA-2814
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17, 1.18
>         Environment: Source code in MailContentHandler.java, 
> handleInlineBodyPart() function
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: eml, extraction, parser
>
> When we are indexing EML file, the priority setting of TIka is using 
> text/html. However, it contains alot of words like "*FONT-SIZE: 9pt; 
> FONT-FAMILY: arial*" in the content, and all of these are not removed by 
> Tika, which makes the content very cluttered and unreadable.
>  
>  This is what is output in the content after being extracted by Tika:
> {{ \{{ "content":" font-size: 14pt; font-family: book antiqua, palatino, 
> serif; Hi There, <br><br> font-size: 14pt; font-family: book antiqua, 
> palatino, serif; My client owns the domain name “ font-size: 14pt; color: 
> #0000ff; font-family: arial black, sans-serif; TravelInsuranceEurope.com 
> font-size: 14pt; font-family: book antiqua, palatino, serif; ” and is 
> considering putting it in market. It is keyword rich domain with good search 
> volume,adword bidding and type-in-traffic. <br><br> font-size: 14pt; 
> font-family: book antiqua, palatino, serif; Based on our extensive study, we 
> strongly feel that you should consider buying this domain name to improve the 
> SEO, Online visibility, brand image, authority and type-in-traffic for your 
> business. We also do provide free 1 year hosting and unlimited emails along 
> with domain name. <br><br> font-size: 14pt; font-family: book antiqua, 
> palatino, serif; Besides this, if you need any other domain name, web and app 
> designing services and digital marketing services (SEO, PPC and SMO) at 
> reasonable charges, feel free to contact us. <br><br> font-size: 14pt; 
> font-family: book antiqua, palatino, serif; Best Regards, <br><br> font-size: 
> 14pt; font-family: book antiqua, palatino, serif; Josh <br><br>"}}}}
>  
> In the MailContentHandler.java code, under the function 
> handleInlineBodyPart(), for MediaType.TEXT_HTML, it is using the 
> HtmlParser.class, However, this parser is not doing the job of removing 
> "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", and all these get output to the 
> content. We should resolve the issue with this HtmlParser so that it is able 
> to remove those tag, and make the content readable after extraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2814) Extracted content of EML file contains words like "FONT-SIZE: 9pt; FONT-FAMILY: arial"

Reply via email to