[
https://issues.apache.org/jira/browse/TIKA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744228#comment-16744228
]
Edwin Yeo Zheng Lin edited comment on TIKA-2814 at 1/16/19 4:32 PM:
--------------------------------------------------------------------
I believe it is due to this part (from the code shown below) in tika-parser,
from the file MailContentHandler.java.
The setting is coded to give text/html the highest score than text/plain. So
this could be the reason why it will always get from text/html when it is
available.
We should give text/plain the highest score, or allow user to choose which to
give higher score?
{{{\{ private int score(Part part) {}}}}
{{ \{{ if (part == null) {}}}}
{{ \{{ return 0;}}}}
{{ \{{ }}}}}
{{ \{{ if (part instanceof BodyContents) {}}}}
{{ \{{ String contentType =
((BodyContents)part).metadata.get(Metadata.CONTENT_TYPE);}}}}
{{ \{{ if (contentType == null) {}}}}
{{ \{{ return 0;}}}}
{{ \{{ } else if
(contentType.equalsIgnoreCase(MediaType.TEXT_PLAIN.toString())) {}}}}
{{ \{{ return 1;}}}}
{{ \{{ } else if (contentType.equalsIgnoreCase("application/rtf")) {}}}}
{{ \{{ //TODO – is this the right definition in rfc822 for rich text?!}}}}
{{ \{{ return 2;}}}}
{{ \{{ } else if (contentType.equalsIgnoreCase(MediaType.TEXT_HTML.toString()))
{}}}}
{{ \{{ return 3;}}}}
{{ \{{ }}}}}
{{ \{{ }}}}}
{{ \{{ return 4;}}}}
{{ \{{ }}}}}
was (Author: edwinyeozl):
I believe it is due to this part (from the code shown below) in tika-parser,
from the file MailContentHandler.java.
The setting is coded to give text/html the highest score than text/plain. So
this could be the reason why it will always get from text/html when it is
available.
We should give text/plain the highest score, or allow user to choose which to
give higher score?
{{ private int score(Part part) {}}
{{ if (part == null) {}}
{{ return 0;}}
{{ }}}
{{ if (part instanceof BodyContents) {}}
{{ String contentType =
((BodyContents)part).metadata.get(Metadata.CONTENT_TYPE);}}
{{ if (contentType == null) {}}
{{ return 0;}}
{{ } else if (contentType.equalsIgnoreCase(MediaType.TEXT_PLAIN.toString())) {}}
{{ return 1;}}
{{ } else if (contentType.equalsIgnoreCase("application/rtf")) {}}
{{ //TODO -- is this the right definition in rfc822 for rich text?!}}
{{ return 2;}}
{{ } else if (contentType.equalsIgnoreCase(MediaType.TEXT_HTML.toString())) {}}
{{ return 3;}}
{{ }}}
{{ }}}
{{ return 4;}}
{{ }}}
> Extracted content of EML file contains words like "FONT-SIZE: 9pt;
> FONT-FAMILY: arial"
> --------------------------------------------------------------------------------------
>
> Key: TIKA-2814
> URL: https://issues.apache.org/jira/browse/TIKA-2814
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.17, 1.18
> Environment: Source code in MailContentHandler.java,
> handleInlineBodyPart() function
> Reporter: Edwin Yeo Zheng Lin
> Priority: Major
> Labels: eml, extraction, parser
>
> When we are indexing EML file, the priority setting of TIka is using
> text/html. However, it contains alot of words like "*FONT-SIZE: 9pt;
> FONT-FAMILY: arial*" in the content, and all of these are not removed by
> Tika, which makes the content very cluttered and unreadable.
>
> This is what is output in the content after being extracted by Tika:
> {{ \{{ "content":" font-size: 14pt; font-family: book antiqua, palatino,
> serif; Hi There, <br><br> font-size: 14pt; font-family: book antiqua,
> palatino, serif; My client owns the domain name “ font-size: 14pt; color:
> #0000ff; font-family: arial black, sans-serif; TravelInsuranceEurope.com
> font-size: 14pt; font-family: book antiqua, palatino, serif; ” and is
> considering putting it in market. It is keyword rich domain with good search
> volume,adword bidding and type-in-traffic. <br><br> font-size: 14pt;
> font-family: book antiqua, palatino, serif; Based on our extensive study, we
> strongly feel that you should consider buying this domain name to improve the
> SEO, Online visibility, brand image, authority and type-in-traffic for your
> business. We also do provide free 1 year hosting and unlimited emails along
> with domain name. <br><br> font-size: 14pt; font-family: book antiqua,
> palatino, serif; Besides this, if you need any other domain name, web and app
> designing services and digital marketing services (SEO, PPC and SMO) at
> reasonable charges, feel free to contact us. <br><br> font-size: 14pt;
> font-family: book antiqua, palatino, serif; Best Regards, <br><br> font-size:
> 14pt; font-family: book antiqua, palatino, serif; Josh <br><br>"}}}}
>
> In the MailContentHandler.java code, under the function
> handleInlineBodyPart(), for MediaType.TEXT_HTML, it is using the
> HtmlParser.class, However, this parser is not doing the job of removing
> "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", and all these get output to the
> content. We should resolve the issue with this HtmlParser so that it is able
> to remove those tag, and make the content readable after extraction.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)