[ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653715#comment-16653715 ]
Rafael Ferreira commented on TIKA-2543: --------------------------------------- If someone can point in the general area of the problem, I'm happy to try to get a PR out myself. Could It be a mime identification issue causing the correct parser to not be called? > No content extraction for application/x-webarchive format > --------------------------------------------------------- > > Key: TIKA-2543 > URL: https://issues.apache.org/jira/browse/TIKA-2543 > Project: Tika > Issue Type: Bug > Affects Versions: 1.17 > Environment: MacOS 10.13.2 JDK8 > Reporter: Rafael Ferreira > Priority: Minor > Attachments: Apache Tika – Configuring Tika.webarchive > > > Steps to reproduce: > # Using safari save any web page as "webarchive" > # Use tika to extract the archive content like the example below > Expected result: > I would expect tika to extract the html contents from the webarchive > Actual results: > Nothing is extracted albeit the right mime type is identified. > {code:java} > try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, > Charsets.UTF_8)) { > TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig(); > // this looks for content anywhere in the page independently of > orientation > tesseractOCRConfig.setPageSegMode("11"); > ParseContext context = new ParseContext(); > context.set(Parser.class, tika.getParser()); > context.set(TesseractOCRConfig.class, tesseractOCRConfig); > try (InputStream fd = Files.newInputStream(path)) { > tika.getParser().parse(fd, new WriteOutContentHandler(writer), new > Metadata(), context); > } catch (SAXException e) { > throw new EngineError(e); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)