[
https://issues.apache.org/jira/browse/TIKA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743315#comment-16743315
]
Tim Allison edited comment on TIKA-2814 at 1/15/19 7:19 PM:
------------------------------------------------------------
Are you able to share an example input file?
was (Author: [email protected]):
Are you able to share an example file?
> Extracted content of EML file contains words like "FONT-SIZE: 9pt;
> FONT-FAMILY: arial"
> --------------------------------------------------------------------------------------
>
> Key: TIKA-2814
> URL: https://issues.apache.org/jira/browse/TIKA-2814
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.17, 1.18
> Environment: Source code in MailContentHandler.java,
> handleInlineBodyPart() function
> Reporter: Edwin Yeo Zheng Lin
> Priority: Major
> Labels: eml, extraction, parser
>
> When we are indexing EML file, the priority setting of TIka is using
> text/html. However, it contains alot of words like "*FONT-SIZE: 9pt;
> FONT-FAMILY: arial*" in the content, and all of these are not removed by
> Tika, which makes the content very cluttered and unreadable.
>
> This is what is output in the content after being extracted by Tika:
> {{ \{{ "content":" font-size: 14pt; font-family: book antiqua, palatino,
> serif; Hi There, <br><br> font-size: 14pt; font-family: book antiqua,
> palatino, serif; My client owns the domain name “ font-size: 14pt; color:
> #0000ff; font-family: arial black, sans-serif; TravelInsuranceEurope.com
> font-size: 14pt; font-family: book antiqua, palatino, serif; ” and is
> considering putting it in market. It is keyword rich domain with good search
> volume,adword bidding and type-in-traffic. <br><br> font-size: 14pt;
> font-family: book antiqua, palatino, serif; Based on our extensive study, we
> strongly feel that you should consider buying this domain name to improve the
> SEO, Online visibility, brand image, authority and type-in-traffic for your
> business. We also do provide free 1 year hosting and unlimited emails along
> with domain name. <br><br> font-size: 14pt; font-family: book antiqua,
> palatino, serif; Besides this, if you need any other domain name, web and app
> designing services and digital marketing services (SEO, PPC and SMO) at
> reasonable charges, feel free to contact us. <br><br> font-size: 14pt;
> font-family: book antiqua, palatino, serif; Best Regards, <br><br> font-size:
> 14pt; font-family: book antiqua, palatino, serif; Josh <br><br>"}}}}
>
> In the MailContentHandler.java code, under the function
> handleInlineBodyPart(), for MediaType.TEXT_HTML, it is using the
> HtmlParser.class, However, this parser is not doing the job of removing
> "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", and all these get output to the
> content. We should resolve the issue with this HtmlParser so that it is able
> to remove those tag, and make the content readable after extraction.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)