[I] Tika's detector breaks mime type guessing [stormcrawler]

via GitHub Sun, 07 Sep 2025 07:20:43 -0700


mvolikas opened a new issue, #1650:
URL: https://github.com/apache/stormcrawler/issues/1650


   ### Version
   
   main branch
   
   ### Describe what's wrong
   
   Tika's detector (used by 
[JsoupParserBolt](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java#L500))
 seems to treat every file as potentially being an archive file and then fails 
because it actually isn't.
   Related to https://issues.apache.org/jira/browse/TIKA-4469.
   Comment by @rzo1 in dev list:
   
   > In the end, we should not use Tika's Detector but a TikaInputStream 
instead like that:
   > try (TikaInputStream tis = TikaInputStream.get(data)) { final Metadata 
metadata = new Metadata(); metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, 
file.getFileName()); final MediaType mediaType = 
MimeTypes.getDefaultMimeTypes().detect(tis, metadata); 
   
   ### Error message and/or stacktrace
   
   > Exception while guessing mimetype on https://apache.org/: 
org.apache.commons.compress.archivers.ArchiveException: No Archiver found for 
the stream signature
   
   ### How to reproduce
   
   Run a crawl with the single seed URL "https://apache.org/";.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Tika's detector breaks mime type guessing [stormcrawler]

Reply via email to