-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://git.reviewboard.kde.org/r/104310/
-----------------------------------------------------------

Review request for Amarok.


Description
-------

Amarok incorrectly scans files with non-ascii characters in tags. The symptom 
is that 
some of the files have two "invalid UTF character" symbols instead of a single 
non-ascii
character (looks like <?><?>, question mark inside a black circle). Most 
visible effect 
of this issue is that some albums end up in Various Artists because one of the 
tracks 
had artist name corrupted in this way.  It is not limited to artist name, 
though - 
there are tracks with corrupted album names or titles.                          
                                                                                
                                         
                                                                                
                                                                                
                                               
The reason for this issue is as follows. When Amarok invokes collection scanner 
                                                       
process, it receives the results from the amarokcollectionscanner over a pipe. 
Here is 
a snippet of code from src/core-impl/collections/db/ScanManager.cpp:

void    
ScannerJob::getScannerOutput()
{
     m_incompleteTagBuffer += m_scanner->readAll();                             
                                                                         
}

The m_incompleteTagBuffer is declared in 
src/core-impl/collections/db/ScanManager.h:

     QString m_incompleteTagBuffer

However, m_scanner->readAll() returns QByteArray, not QString. This is okay for 
ASCII
characters (which are 1 byte in UTF8), but breaks in case of multibyte 
sequences. If
readAll() method returns a block which terminates in a middle of the multibyte 
sequence,
conversion to QString in ScannerJob::getScannerOutput replaces the last 
character with
"invalid UTF character" symbol. When the next block is read, it starts in the 
middle of
UTF8 multibyte sequence - so it gets replaced with one more "invalid UTF 
character"
symbol. Thus, a single multibyte UTF8 character is replaced with two "invalid 
character"
symbols.

The solution implemented by the attached patch is to store incomplete 
information as
QByteArray and search for partial ("</directory>") or full ("</scanner>") 
elements in the
byte stream, before conversion to QString. Complete blocks can be safely 
converted to
QString, as the multibyte characters are inside the XML tags.


Diffs
-----

  src/core-impl/collections/db/ScanManager.h 5f0d153 
  src/core-impl/collections/db/ScanManager.cpp 97d0b1c 

Diff: http://git.reviewboard.kde.org/r/104310/diff/


Testing
-------


Thanks,

Alexey Neyman

_______________________________________________
Amarok-devel mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/amarok-devel

Reply via email to