On Jun 23, "H. Turgut Uyar" <[EMAIL PROTECTED]> wrote: > A new class would probably be cleaner but at the moment I can't > see how it should interact with the other components.
In my opinion, it shouldn't: let's create a brand-new class and have the two kind of parsers (SAX and DOM) happily live side by side, with no overlapping at all. Once a DOM parser is written, we'll start using it exclusively. > If you can layout the skeleton for such a class, I'd be happy to > fill in the methods :-) I try to work on this in the next week (first I have to fix some problem with the old parsers). > What do you say, which one should I look at next? Maybe it will > give me a better idea about what the common parts are between those > parsers and see what the base parser could implement. A difficult question. :-) Maybe a good candidate is the movieParser.HTMLOfficialsitesParser class: it handles 7 different pages for movies (and another one for persons); the only thing that changes, is the key in the returned dictionary. E.g.: {'official sites': [list, of, official, sites]} {'external reviews': [list, of, external, reviews]} ... After I've deployed the new class layout, I think you can start with that: it should be a good test-bed for a very simple yet generic parser. > There is one glitch I couldn't solve: the lxml parser handles > unicode as the existing (sax) parser does but the beautifulsoup > parser uses entities at some places. For example, if you search for > "a better tomorrow", the last entry becomes: > "Yesterday, Today & Tomorrow (1986)" I see; I can think of two solutions (to be applied only using BS): 1. convert, with a regular expression, every entity before the string is passed to BS (that if it's possible to tell BS to _not_ convert them back to entities) 2. iterate over the returned dictionary, searching for unicode strings and replace the contained entities in place. But I think this is not an urgent issue (even if it must be fixed, before a release). Thanks! -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/ ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel