Mariusz Cieślukowski created TIKA-3100:
------------------------------------------
Summary: RFC822Parser ignore charset when extractAllAlternatives
set to true
Key: TIKA-3100
URL: https://issues.apache.org/jira/browse/TIKA-3100
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.24.1
Environment:
Windows 10 x64
OpenJDK 14
Reporter: Mariusz Cieślukowski
Attachments: testRFC822_quoted_charset_iso_8859_2
In default mode RFC822Parser seems to ignore charset defined in headers when
detect content. When I set "extractAllAlternatives " to false then content
seems fine.
Test case:
{code:java}
@Test
public void testQuotedPrintableCharset() {
Metadata metadata = new Metadata();
InputStream stream =
getStream("test-documents/testRFC822_quoted_charset_iso_8859_2");
ContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
try {
RFC822Parser emailparser = new RFC822Parser();
emailparser.setExtractAllAlternatives(true);
emailparser.parse(stream, handler, metadata, context);
String bodyText = handler.toString();
assertTrue(bodyText.contains("Dzie\u0144 dobry."));
} catch (Exception e) {
fail("Exception thrown: " + e.getMessage());
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)