mvolikas opened a new issue, #1650: URL: https://github.com/apache/stormcrawler/issues/1650
### Version main branch ### Describe what's wrong Tika's detector (used by [JsoupParserBolt](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java#L500)) seems to treat every file as potentially being an archive file and then fails because it actually isn't. Related to https://issues.apache.org/jira/browse/TIKA-4469. Comment by @rzo1 in dev list: > In the end, we should not use Tika's Detector but a TikaInputStream instead like that: > try (TikaInputStream tis = TikaInputStream.get(data)) { final Metadata metadata = new Metadata(); metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, file.getFileName()); final MediaType mediaType = MimeTypes.getDefaultMimeTypes().detect(tis, metadata); ### Error message and/or stacktrace > Exception while guessing mimetype on https://apache.org/: org.apache.commons.compress.archivers.ArchiveException: No Archiver found for the stream signature ### How to reproduce Run a crawl with the single seed URL "https://apache.org/". ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org