[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316029#comment-16316029 ]
Nick Burch commented on TIKA-2543: ---------------------------------- Based on https://en.wikipedia.org/wiki/Webarchive the underlying format for these is the apple binary plist format. It doesn't look like Commons Compress can handle this for us, unless I've missed that? Tika Devs - anyone know of a suitably licensed plist library for Java? [~cleverfoo] Are you able to create a small webarchive file for a simple-ish page we could use for testing? Maybe something like http://tika.apache.org/1.17/configuring.html ? > No content extraction for application/x-webarchive format > --------------------------------------------------------- > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug > Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 > Reporter: Rafael Ferreira > Priority: Minor > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)