Hey guys, first of all, I've been an active member in #qt and I appreciate all the help I've received from the guys in there. I started Qt a couple of months ago for a new project and I could not have got this far without everyone's help. I am very grateful.
I'm hoping for a little bit of a code review to ensure I am doing UTF-8 right. I am developing a csv fixer that allows a user to select the encoding of the file so we can correctly handle encoding errors. Also, there is a choice to skip this, in which case I am just removing the fields that have encoding errors. So what I need is a bit of a review just to be absolutely sure that I am doing this right. I know encoding is almost an impossible problem but hopefully I can be told otherwise. My logic is as follows. At the beginning, I read the source CSV file counting the number of rows with invalid UTF8 characters using my bool isUTF8() shown below. I collect either 200 rows with encoding errors, or all encoding errors of the file. Whichever comes first. These rows are then shown to the user, and they are given a screen that allows them to pick the encoding of the file, showing the data as the choose. Once this is done, the file is read from the beginning as the selected encoding. Meaning, I read the row from the csv file, then encode the string as the selected encoding the user specified, and attempt to write them to the output file. Any encoding errors (non-UTF8 data) that were encountered as the selected encoding are written to an encoding errors file. At the end of parsing, the user can choose the encoding of those items until all items are written to the output file. Now this implementation seems to correctly ensure only valid UTF8 is written to the file, but it seems that a majority of the data with encoding errors just gets replaced by a replacement character that is valid UTF8 defeating the purpose. And if I just skip this step and choose to remove all fields with encoding errors, it seems to work perfect, except for Windows in which case badly encoded characters are being replaced with '?'. Anyways, here is the implementation of my string encoder. This function gets the std::string row of the .csv file. inline QString encode_string( std::string str, std::string encoding ) { QByteArray encoded; if ( encoding == UTF8 ) { encoded = QString::fromStdString( str ).toUtf8(); } else if ( encoding == ISO88591 ) { QTextCodec *codec = QTextCodec::codecForName( ISO88591 ); QByteArray enc( str.c_str(), str.length() ); return QString( codec->toUnicode( enc ) ); } else if ( encoding == ISO88592 ) { QTextCodec *codec = QTextCodec::codecForName( ISO88592 ); QByteArray enc( str.c_str(), str.length() ); return QString( codec->toUnicode( enc ) ); } else if ( encoding == WINDOWS1251 ) { QTextCodec *codec = QTextCodec::codecForName( WINDOWS1251 ); QByteArray enc( str.c_str(), str.length() ); return QString( codec->toUnicode( enc ) ); } else if ( encoding == WINDOWS1252 ) { QTextCodec *codec = QTextCodec::codecForName( WINDOWS1252 ); QByteArray enc( str.c_str(), str.length() ); return QString( codec->toUnicode( enc ) ); } else if ( encoding == SHIFTJIS ) { QTextCodec *codec = QTextCodec::codecForName( SHIFTJIS ); QByteArray enc( str.c_str(), str.length() ); return QString( codec->toUnicode( enc ) ); } else if ( encoding == EUCKR ) { QTextCodec *codec = QTextCodec::codecForName( EUCKR ); QByteArray enc( str.c_str(), str.length() ); return QString( codec->toUnicode( enc ) ); } else if ( encoding == EUCJP ) { QTextCodec *codec = QTextCodec::codecForName( EUCJP ); QByteArray enc( str.c_str(), str.length() ); return QString( codec->toUnicode( enc ) ); } else { qDebug() << Q_FUNC_INFO << "Hit bad encoding case."; return QString( encoded ); } } Here is the implementation of my valid utf8 checker bool Parser::isUTF8( std::string string ) { QString utf8str = QString::fromUtf8( string.c_str() ); for ( int i = 0; i < utf8str.length(); i++ ) { if ( utf8str.at( i ) == -3 ) { return false; } return true; } And here is my call point: //Write utf8 version of the string QString encoded = util::encode_string( joined, this->encoding ); //If the encoded string has UTF8 errors, write it to the encode error file if ( !isUTF8( encoded.toStdString() ) ) { QFile encode( this->encodeErrFileName ); encode.open( QIODevice::ReadWrite | QIODevice::Append ); QTextStream encodeOut( &encode ); encodeOut << encoded << "\r\n"; encode.close(); } else { emit cleanRow(); tmpFileWriter << encoded << "\r\n"; } output.close(); My function calls go like this: Read file as standard (C++ ifstream, no encoding done) -> encode the string (using util::encode_string) -> check if encoded string is valid UTF8 (using bool isUTF8(str)) -> if true, write to output file, if false write to encoding errors file.
_______________________________________________ Interest mailing list Interest@qt-project.org http://lists.qt-project.org/mailman/listinfo/interest