Rafael Ferreira created TIKA-2543:
-------------------------------------
Summary: No content extraction for application/x-webarchive format
Key: TIKA-2543
URL: https://issues.apache.org/jira/browse/TIKA-2543
Project: Tika
Issue Type: Bug
Affects Versions: 1.17
Environment: MacOS 10.13.2 JDK8
Reporter: Rafael Ferreira
Priority: Minor
Steps to reproduce:
# Using safari save any web page as "webarchive"
# Use tika to extract the archive content like the example below
Expected result:
I would expect tika to extract the html contents from the webarchive
Actual results:
Nothing is extracted albeit the right mime type is identified.
{code:java}
try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath,
Charsets.UTF_8)) {
TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
// this looks for content anywhere in the page independently of
orientation
tesseractOCRConfig.setPageSegMode("11");
ParseContext context = new ParseContext();
context.set(Parser.class, tika.getParser());
context.set(TesseractOCRConfig.class, tesseractOCRConfig);
try (InputStream fd = Files.newInputStream(path)) {
tika.getParser().parse(fd, new WriteOutContentHandler(writer), new
Metadata(), context);
} catch (SAXException e) {
throw new EngineError(e);
}
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)