On Jun 23, "H. Turgut Uyar" <[EMAIL PROTECTED]> wrote:

> A new class would probably be cleaner but at the moment I can't
> see how it should interact with the other components.

In my opinion, it shouldn't: let's create a brand-new class and
have the two kind of parsers (SAX and DOM) happily live side by
side, with no overlapping at all.
Once a DOM parser is written, we'll start using it exclusively.

> If you can layout the skeleton for such a class, I'd be happy to
> fill in the methods :-)

I try to work on this in the next week (first I have to fix some
problem with the old parsers).

> What do you say, which one should I look at next? Maybe it will
> give me a better idea about what the common parts are between those
> parsers and see what the base parser could implement.

A difficult question. :-)
Maybe a good candidate is the movieParser.HTMLOfficialsitesParser
class: it handles 7 different pages for movies (and another one
for persons); the only thing that changes, is the key in the
returned dictionary.
E.g.:
  {'official sites': [list, of, official, sites]}
  {'external reviews': [list, of, external, reviews]}
  ...

After I've deployed the new class layout, I think you can start
with that: it should be a good test-bed for a very simple yet
generic parser.

> There is one glitch I couldn't solve: the lxml parser handles
> unicode as the existing (sax) parser does but the beautifulsoup
> parser uses entities at some places. For example, if you search for
> "a better tomorrow", the last entry becomes:
>   "Yesterday, Today &#38; Tomorrow (1986)"

I see; I can think of two solutions (to be applied only using BS):
1. convert, with a regular expression, every entity before the string
   is passed to BS (that if it's possible to tell BS to _not_ convert
   them back to entities)
2. iterate over the returned dictionary, searching for unicode strings
   and replace the contained entities in place.

But I think this is not an urgent issue (even if it must be fixed,
before a release).


Thanks!
-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to