Current algorithm for `has_html_view` looks something like this:

1. Guess mimetype based on filename only (`mimetypes.guess_type`)
2. If it is `text/*` assume we can display it
3. If not - check various lists of "viewable extensions", if one of them 
contains extension of given filename assume we can display it
4. If not - guess file type based on content (using `python-magic`) if it is 
text assume we can display it

The problem with the example above is that ".d" is a valid extension for D 
programming language source files. On step (1) it is detected as such (mimetype 
`text/x-dsrc`).

One way to fix this is always check content of the file if we think it is a 
text file and we can display it, but it will slow down things significantly. On 
my local machine doing content-based check is 184 times slower in the best case 
scenario (120ms vs. 0.65ms). Thus it will be slower for every viewable file 
(and most of them are, in a typical repo).

I have checked `binaryornot` and `filemagic` libraries, but they're working 
with the same speed as `python-magic`, which we're using now. Most of the time 
is probably spent accessing the filesystem and not actualy guessing file's type.

I can't think of any way to exclude false positives without a performance 
penalty. Any thoughts?

To reduce performance penalty we can check file content only for files with 
more than one dot in the filename (like `2.jpg.d`), but it seems like very poor 
heuristic to me.


---

** [tickets:#7962] Better binary file detection**

**Status:** in-progress
**Milestone:** unreleased
**Labels:** 42cc 
**Created:** Wed Aug 12, 2015 03:25 PM UTC by Heith Seewald
**Last Updated:** Tue Aug 18, 2015 08:56 AM UTC
**Owner:** Igor Bondarenko


Improve our binary/text file detection.

[here is an 
example](https://sourceforge.net/p/planetexpress/git/ci/ba49bf3d9b3185ea2b0dc5cb6f7a3f8a6781f0c4/)
 of a jpg with a ".d" extention that made it through the **has_html_view** 
function(  `allura.model.repository.Blob#has_html_view`)


Performance should be a primary consideration because of the large number of 
calls on bigger commits.


---

Sent from forge-allura.apache.org because dev@allura.apache.org is subscribed 
to https://forge-allura.apache.org/p/allura/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://forge-allura.apache.org/p/allura/admin/tickets/options.  Or, if this is 
a mailing list, you can unsubscribe from the mailing list.

Reply via email to