Calibre uses "pdftohtml" to convert PDF files into other formats. Older "pdftohtml" provides wrong output around surrogate pair characters. This makes choke Python lxml library.
Use the newest "pdftohtml" to solve this problem. Install the newest "poppler-utils" package (0.85.0-2) from Debian unstable. -- YOKOTA

