[
https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316029#comment-16316029
]
Nick Burch commented on TIKA-2543:
----------------------------------
Based on https://en.wikipedia.org/wiki/Webarchive the underlying format for
these is the apple binary plist format. It doesn't look like Commons Compress
can handle this for us, unless I've missed that?
Tika Devs - anyone know of a suitably licensed plist library for Java?
[~cleverfoo] Are you able to create a small webarchive file for a simple-ish
page we could use for testing? Maybe something like
http://tika.apache.org/1.17/configuring.html ?
> No content extraction for application/x-webarchive format
> ---------------------------------------------------------
>
> Key: TIKA-2543
> URL: https://issues.apache.org/jira/browse/TIKA-2543
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.17
> Environment: MacOS 10.13.2 JDK8
> Reporter: Rafael Ferreira
> Priority: Minor
>
> Steps to reproduce:
> # Using safari save any web page as "webarchive"
> # Use tika to extract the archive content like the example below
> Expected result:
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified.
> {code:java}
> try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath,
> Charsets.UTF_8)) {
> TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
> // this looks for content anywhere in the page independently of
> orientation
> tesseractOCRConfig.setPageSegMode("11");
> ParseContext context = new ParseContext();
> context.set(Parser.class, tika.getParser());
> context.set(TesseractOCRConfig.class, tesseractOCRConfig);
> try (InputStream fd = Files.newInputStream(path)) {
> tika.getParser().parse(fd, new WriteOutContentHandler(writer), new
> Metadata(), context);
> } catch (SAXException e) {
> throw new EngineError(e);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)