Re: Review Request: Don't hang when determining MIME type of corrupted files

Miroslav Ľos Wed, 19 Oct 2011 00:34:43 -0700


> On Aug. 21, 2011, 10:07 a.m., David Faure wrote:
> > Thanks Peter and Miroslav. The analysis looks correct, the pre-read part of 
> > the patch looks good. I'm just wondering about using Unbuffered. If someone 
> > installs a mimetype definition with multiple rules trying to match some 
> > bytes after the 2K limit, then all this seeking-and-reading back and forth 
> > will be very slow, in unbuffered mode (since neither cache will be used).
> 
> Miroslav Ľos wrote:
>     I find Unbuffered causing slowness improbable. From what I've seen in 
> uses of KMimeType::findBy\w*Content in kdelibs, they all (e.g. ftp* and http 
> kioslaves) rather provide their own buffer in a QByteArray rather than a 
> QIODevice; most just provide the path (which is only opened using QFile if it 
> is_local_file).
>     
>     All QFile's buffering is implemented in the QIODevice superclass, it adds 
> Unbuffered in open() to its openMode for its fileEngine() backend. Thus, no 
> buffering is propagated down. The unnecessary 16K read did hit several more 
> EIO's on the broken CD I have, but it is just cosmetic I guess.
>     
>     Nonetheless, I find most uses of these functions ever will be through 
> QFile or QBuffer and any client passing another QIODevice may open it itself. 
> Maybe a note in the documentation could help them do it if necessary.
>     
>     Finally, I wonder if buffering makes a difference as we are only using a 
> few small block reads, not many getChar()'s. Then again, the documentation to 
> QTcpSocket says they cannot be opened (i.e. ignore I guess) Unbuffered, which 
> is what any other QIODevices may do as well.
>     
>     * ftp passes a 1K-capped buffer (kioslave/ftp/ftp.cpp:2471). That may be 
> insufficient for the single rule on my system that needed 1029 bytes.
> 
> David Faure wrote:
>     (Pasting here, after realizing you probably didn't see my reply on 
> kde-core-devel)
>     
>     > I find Unbuffered causing slowness improbable. From what I've seen in 
> uses
>     > of KMimeType::findBy\w*Content in kdelibs, they all (e.g. ftp* and http
>     > kioslaves) rather provide their own buffer in a QByteArray rather than a
>     > QIODevice; most just provide the path (which is only opened using QFile 
> if
>     > it is_local_file).
>     
>     The most common case for mimetype determination is from the file manager 
>     listing local files, in which case findByUrl will use a QFile for content-
>     determination.
>     
>     Ah, and BTW I have just found a magic rule that needs 4K of data:
>     vmware-player.xml says:
>       <match type="string" value='config.version = "' offset="0:4096"/>
>     
>     > All QFile's buffering is implemented in the QIODevice superclass, it 
> adds
>     > Unbuffered in open() to its openMode for its fileEngine() backend. 
> Thus, no
>     > buffering is propagated down.
>     
>     I'm not sure what you mean there.
>     
>     > The unnecessary 16K read did hit several more
>     > EIO's on the broken CD I have, but it is just cosmetic I guess.
>     
>     Yes I'm not sure it's worth optimizing for this special case.
>     
>     > Nonetheless, I find most uses of these functions ever will be through 
> QFile
>     > or QBuffer
>     
>     Sure. It's QFile that I have in mind here, when I say that seeking back 
> and 
>     forth will be slow, in unbuffered mode.
>     
>     > Finally, I wonder if buffering makes a difference as we are only using 
> a few
>     > small block reads, not many getChar()'s.
>     
>     It's about seeking and reading, vs just having the data in memory.
>     
>     > * ftp passes a 1K-capped buffer (kioslave/ftp/ftp.cpp:2471). That may be
>     > insufficient for the single rule on my system that needed 1029 bytes.
>     
>     Right. Small bug, but a corner case (only matters if extension unknown).
>
> 
> Christoph Feck wrote:
>     Miroslav, can you clarify why Unbuffered is beneficial? I did not 
> understand it from the ongoing discussion, and I have the feeling David did 
> not either :)
> 
> Miroslav Ľos wrote:
>     Well, apparently I failed to actually publish my previous reply. My 
> apologies. So here it is, with some additions:
>     
>     >> All QFile's buffering is implemented in the QIODevice superclass, it 
> adds
>     >> Unbuffered in open() to its openMode for its fileEngine() backend. 
> Thus, no
>     >> buffering is propagated down.
>     
>     >I'm not sure what you mean there.
>     
>     I mean this excerpt from QFile::open in 
> qt-everywhere-opensource-src-4.7.3/src/corelib/io/qfile.cpp:
>        995      // QIODevice provides the buffering, so there's no need to 
> request it from the file engine.
>        996      if (fileEngine()->open(mode | QIODevice::Unbuffered)) {
>        997          QIODevice::open(mode);
>        998          if (mode & Append)
>        999              seek(size());
>       1000          return true;
>       1001      }
>     
>     That means the only thing Unbuffered controls in a QFile (i.e. opened 
> with open(OpenMode)) is the use of a buffer provided in its superclass. Every 
> time it needs to read some data from the device, it first tries to fill the 
> buffer (16K by default). This is only of advantage if the file is read in 
> multiple smaller consecutive blocks in sequence. Most importantly, if there 
> is a seek() to a new position, the whole 16K is read anew, unless the 
> position sought is inside the buffer (duh) and after the current position - 
> no seeking back, even to the start of previous read. This means that opening 
> the QFile Unbuffered is more efficient overall, even if all reads succeed, 
> more so if there are read errors.
>     
>     The seeking back behavior is better explained by this excerpt from 
> qiodevice.cpp, in seek():
>        649        if (offset < 0
>        650                || offset >= qint64(d->buffer.size()))
>        651            // When seeking backwards, an operation that is only 
> allowed for
>        652            // random-access devices, the buffer is cleared. The 
> next read
>        653            // operation will then refill the buffer. We can 
> optimize this, if we
>        654            // find that seeking backwards becomes a significant 
> performance hit.
>        655            d->buffer.clear();
>        656        else if (!d->buffer.isEmpty())
>        657            d->buffer.skip(int(offset));
>     
>     To demonstrate (rather verbosely I fear), below is the actual relevant 
> behavior on an uncorrupted file A) before patching; B) my patch with 
> buffering left intact and C) my patch.
>     
>     In case A), the 16K is read whenever a bigger chunk of data is needed for 
> a rule (it is retained at kmimemagicrule.cpp:134 iff the read was from the 
> beginning). If the file were corrupted, and only could read e.g. first 8K of 
> data, every read() would result in the drive trying to read the data until 
> timeout (first of each pair of reads would return 8192, second -1 EIO, both 
> hanging), which is the source of the hangs I was trying to solve. If the file 
> weren't readable at all, there would still be a pair of reads, one for the 
> buffer, one for the actual size requested, both returning -1 EIO.
>     
>     In case B) you prefer, the situation is much better: the first big buffer 
> read reduces the need for subsequent re-reads. It shows there are rules after 
> 2K on my system after all. It also shows the limited case where buffering may 
> help (VERY little): there is a rule starting at byte 2089; no rule touched 
> that part of the buffer (containg the whole file as of the first read), so no 
> new read is made. From case C) you see it needs 10 bytes, setting the 
> internal buffer pointer (IBP) to byte 2099. Another rule wants 9 bytes from 
> 2112, still satisfied by the buffer (IBP at position 2121 now). Yet another 
> rule needs 9 bytes from 2108 - before IBP: this results in a new 16K read 
> into the buffer from position 2108, replacing the previous contents in it. 
> The IBP is 2117, while the next rule needs 28 bytes back from 2112 again: 
> another 16K read.
>     
>     If you still prefer this solution, then I suggest the whole 16K (more 
> exactly, it is QIODEVICE_BUFFERSIZE, but it is an internal constant in 
> qiodevice_p.h) of the initial buffered read is used in the 'beginning' 
> variable.
>     
>     In case C), only the data actually requested is ever read. It also wins 
> on a lesser count of read() calls over B), as it always verifies it reached 
> EOF if it does not fill the buffer. 
>     
>     Now that I think of it, I guess the filesystem/kernel does buffered reads 
> anyway, so the only difference between cases B) and C) may be that B) just 
> needlessly uses one more buffer.
>     
>     If I managed to convince you about Unbuffered, please fix coding style 
> with spaces around the | operator. 
>     
>     === A ===
>     open("k", O_RDONLY|O_CLOEXEC)           = 3
>     fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     lseek(3, 2089, SEEK_SET)                = 2089
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     read(3, "_ostreamIcT_ES5_PKc\0_ZSt4cout\0_Z"..., 16384) = 11625
>     read(3, "", 4759)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 2112, SEEK_SET)                = 2112
>     lseek(3, 2108, SEEK_SET)                = 2108
>     read(3, "\0_ZSt4cout\0_ZNSt8ios_base4InitC1"..., 16384) = 11606
>     read(3, "", 4778)                       = 0
>     lseek(3, 2112, SEEK_SET)                = 2112
>     read(3, "t4cout\0_ZNSt8ios_base4InitC1Ev\0l"..., 16384) = 11602
>     read(3, "", 4782)                       = 0
>     lseek(3, 546, SEEK_SET)                 = 546
>     read(3, 
> "\0\0\0\0\0\0@\2\0\0\0\0\0\0\1\0\0\0\0\0\0\0\200\25\4e\0(\0\0\0\0"..., 16384) 
> = 13168
>     read(3, "", 3216)                       = 0
>     lseek(3, 546, SEEK_SET)                 = 546
>     read(3, 
> "\0\0\0\0\0\0@\2\0\0\0\0\0\0\1\0\0\0\0\0\0\0\200\25\4e\0(\0\0\0\0"..., 16384) 
> = 13168
>     read(3, "", 3216)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 2080, SEEK_SET)                = 2080
>     close(3) 
>     
>     === B ===
>     open("k", O_RDONLY|O_CLOEXEC)           = 3
>     fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 16384) 
> = 13714
>     read(3, "", 2670)                       = 0
>     lseek(3, 2089, SEEK_SET)                = 2089
>     lseek(3, 2112, SEEK_SET)                = 2112
>     lseek(3, 2108, SEEK_SET)                = 2108
>     read(3, "\0_ZSt4cout\0_ZNSt8ios_base4InitC1"..., 16384) = 11606
>     read(3, "", 4778)                       = 0
>     lseek(3, 2112, SEEK_SET)                = 2112
>     read(3, "t4cout\0_ZNSt8ios_base4InitC1Ev\0l"..., 16384) = 11602
>     read(3, "", 4782)                       = 0
>     lseek(3, 2080, SEEK_SET)                = 2080
>     read(3, "St13basic_ostreamIcT_ES5_PKc\0_ZS"..., 16384) = 11634
>     read(3, "", 4750)                       = 0
>     close(3)
>     
>     === C ===
>     open("k", O_RDONLY|O_CLOEXEC)           = 3
>     fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     lseek(3, 0, SEEK_SET)                   = 0
>     fstat(3, {st_mode=S_IFREG|0755, st_size=13714, ...}) = 0
>     read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\240\f@\0\0\0\0\0"..., 2048) = 
> 2048
>     lseek(3, 2089, SEEK_SET)                = 2089
>     read(3, "_ostreamIc", 10)               = 10
>     lseek(3, 2112, SEEK_SET)                = 2112
>     read(3, "t4cout\0_Z", 9)                = 9
>     lseek(3, 2108, SEEK_SET)                = 2108
>     read(3, "\0_ZSt4cou", 9)                = 9
>     lseek(3, 2112, SEEK_SET)                = 2112
>     read(3, "t4cout\0_ZNSt8ios_base4InitC1", 28) = 28
>     lseek(3, 2080, SEEK_SET)                = 2080
>     read(3, "St13basic_ostreamIcT_ES5_PKc\0", 29) = 29
>     close(3)
>
> 
> David Faure wrote:
>     Excellent investigation.
>     Amazing that QIODevice/QFile is being that stupid with "seeking back", no 
> reason to clear the buffer if the buffer contains the data for the position 
> we want to seek back to!
>     After your analysis of that QIODevice::seek code, do you think it could 
> easily be fixed to not do that, BTW? It looks like it should look at where 
> the buffer started from, and compare -that- with "pos", rather than comparing 
> the current and previous positions.
>     
>     One thing I didn't understand though, is what really happens with 
> Unbuffered. There's no buffering in QIODevice, but there's still buffering in 
> QFileEngine? That sounds like another bug to me, for the case of files which 
> could change at any point in time (e.g. in /proc). I would expect Unbuffered 
> to not be buffered (!), this is why I'm very hesitant to use that "because it 
> works better with the current implementation in Qt"; it sounds like something 
> that could change at some point, while respecting the documented behavior / 
> usual expectations for Unbuffered.
>     
>     Over all, I think this should be fixed in QIODevice, for all cases where 
> some code reads back and forth in a file (I'm soon going to have the same 
> problem with mime.cache binary file parsing which includes a lot of seeking).
>     
>     But of course we can increase the read buffer size in kmimetype in any 
> case, e.g. to 16K. Can you provide a patch for that?
>     
>

I actually wrote a patch for seeking back, but it is of only limited use. 
Seeking back in the buffer is obviously disabled in ReadWrite mode, nobody 
could need that. It also handles ungetChar and peek+ungetBlock, both of which 
can enlarge the buffer more than the prepended data (to allow more ungetChar 
calls without constant resizing), so that the valid data may not be from the 
start of the buffer. Seeking back into the buffer after it was drained would 
require nontrivial changes in the semantics of QIODevicePrivateLinearBuffer (to 
retain which part of the file it holds) and then all over QIODevice (it really 
wasn't designed to allow that).

I uploaded the patch for this limited seeking back to 
http://paste.kde.org/134341/. My system still works, but it certainly isn't a 
sufficient test of all possible code paths and call combinations there may be 
(and rarely are).

In my previous comment I was talking about the generic buffering and filesystem 
caches in the kernel and FS modules that I only conjecture are always used in 
normal filesystems, i.e. not with procfs and the like. These buffers are surely 
not disabled for all Qt applications, no matter that QFile always adds 
Unbuffered to its file engine's open mode, so it is actually never buffered as 
far as that parameter is concerned. I looked deeper into the code; an 
Unbuffered QFSFileEngine opens using open(2), a buffered using fopen(3); fopen 
uses additional user-space buffers. I believe those buffers would do a much 
better job than QIODevice itself. Now I'd rather QFile added Unbuffered to its 
own mode than the file engine's (bummer QIODevice::openMode() isn't virtual, 
then it could be done cleanly like http://paste.kde.org/135283/).

Returning to our concrete use case (and your future one also), I think 
supporting (this level of) seeking back will not help much. Suppose a rule 
matches outside the available data range: in KMimeMagicMatch::match, the file 
is seeked() to the rule's starting position, then a read is made. The buffer 
gets filled with file contents from that position. If the next rule starts just 
a few bytes earlier (similar to what happened with the rules in my previous 
comment, minus the first rule actually starting on the lowest position), the 
buffer is (next to) useless again. If the rule is completely elsewhere, the 
buffer should be filled with data from this new location, dropping its previous 
content completely.

In closing, short of some ordering of the match rules by starting positions 
(then there's no need to seek back at all), using the QIODevice's buffer is 
rarely helpful and usually worse than useless. To capture most rules, the 
initial buffer can be 16K long. Each other rule will most likely need its own 
read, so it should be the shortest necessary.

I cannot edit this review request, so I pasted the modified patch to 
http://paste.kde.org/134353/. It just changes the initial buffer to 16K and 
adds spaces around operators. All pastes will be gone in a month.

As for mime.cache, I think it is usually a rather small file, so it can be read 
into a buffer whole, or maybe QFile::map can be used to share the memory across 
processes.

- Miroslav

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/102391/#review5872
-----------------------------------------------------------

On Aug. 20, 2011, 5:21 p.m., Peter Penz wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://git.reviewboard.kde.org/r/102391/
> -----------------------------------------------------------
> 
> (Updated Aug. 20, 2011, 5:21 p.m.)
> 
> 
> Review request for kdelibs and David Faure.
> 
> 
> Description
> -------
> 
> If KMimeTypeRepository::findFromContent() tries to determine MIME from a file 
> that cannot be read, such as on a corrupted optical disc, a read attempt is 
> made in KMimeMagicMatch::match() for every available rule, resulting in UI 
> hangs (e.g. file dialogs, dolphin) for tens of minutes (see 
> https://bugs.kde.org/show_bug.cgi?id=280446 for more details).
> 
> I've submitted this patch here on behalf of Miroslav ?os, who has submitted 
> the bug-report and also has written the patch.
> 
> 
> This addresses bug 280446.
>     http://bugs.kde.org/show_bug.cgi?id=280446
> 
> 
> Diffs
> -----
> 
>   kdecore/services/kmimetype.cpp 955bf62 
>   kdecore/services/kmimetyperepository.cpp 6ff3d16 
> 
> Diff: http://git.reviewboard.kde.org/r/102391/diff/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Peter Penz
> 
>

Re: Review Request: Don't hang when determining MIME type of corrupted files

Reply via email to