Brian Wanamaker writes: > >...despite appearing normal on the web IE6, Firefox, and under OS > >X/Safari, on my Palm it gets artifacted characters in place of > >curly-quotes and "real" apostrophes (not double-primes and > >single-prime, respectively). He says he writes his pieces on a Mac, > >under Works or Word, which auto-curly-quotes, then cut-and-pastes it > >into Blogger. It displays fine in browsers, but Plucker displays > >something with a euro-denomination character in it.
I hit this problem all the time, on lots of different pages (using the python parser). I've looked for "the right" way to do this in python, so that I could contribute a fix, but haven't found any python utf-8-to-ascii translation libraries, or even a comprehensive table of utf-8 values so that I can write such a library. So I've written a hack to the python parser to translate the most common characters I see in web pages, and when I get a page that has a lot of some new character, I add that character to the list. Not a good solution, but if you have one page you pluck often, it'll work for that. See the code at the end of this message. David A. Desrosiers writes: > I then tried to pluck it with the Python distiller, passing in > the right charset, and it fails to parse it properly. I think that > might be a bug. For pages on which the distiller doesn't fail, what would be the right way to do it? For example, here's a page that uses a lot of three-character sequences for things like ellipsis and emdash. I tried specifying a charset, but it didn't help. I ran this command: plucker-build -H http://riverbendblog.blogspot.com/ -N "Riverbend" -f riverbend --noimages --stayonhost --zlib-compression --maxdepth 1 --charset=utf-8 But the result still displays on the Palm with lots of 3-char sequences like It<a-hat><Euro><trademark>s instead of It's. With the following hack, I see It's, no a-hat or euro or TM. First, around line 882 (just before def add_text) I add this (and this is where you can add additional sequences for pages you pluck often): # "unencode" -- strip out codes the palm won't understand, # for things like smartquote characters. def unencode (self, line): line = string.replace (line, "\342\200\224", "--") line = string.replace (line, "\342\200\230", "`") line = string.replace (line, "\342\200\231", "'") line = string.replace (line, "\342\200\234", "\"") line = string.replace (line, "\342\200\235", "\"") line = string.replace (line, "\342\200\246", "...") return line Then, a few lines down in add_text, just after the "while 1", add a call to unencode: while 1: line = self.unencode(line) new_size = self._approximate_size + len (line) Hope this helps! Sorry it's such a hack; I'd love to hear of a better and more comprehensive solution. ...Akkana _______________________________________________ plucker-list mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-list

