Re: [Imdbpy-devel] First patch for DOM parser

H. Turgut Uyar Mon, 23 Jun 2008 03:57:05 -0700

> 
> I'm more and more impressed. :-)
> I really hope we don't hit some show-stopper problem, in the future:
> this approach is almost amazing.
> If I understand correctly, all that is needed to parse search results,
> are the feed_dom method and the entries in the _k dictionary.
>


For the moment, yes. The parserbase class will use the feed_dom method 
is there is one and fall back to the sgml parser otherwise.

> The first thing, is to decide the overall structure for the parsers.
> I think it's better to separate it from the ParserBase class, and
> move the "engine" of the new parser in another class (or something).
> Once a replacement is written, the old code can be removed altogether
> (or temporarily leaved in place for tests).
> 

Since I'm not very familiar with the code I've tried to keep the 
existing api in place. A new class would probably be cleaner but at the 
moment I can't see how it should interact with the other components. If 
you can layout the skeleton for such a class, I'd be happy to fill in 
the methods :-)

In order to make sure that it is the xpath parser that is really working 
and that things don't fall back to the old method, in the third patch 
I've removed the attributes and methods that are responsible for 
handling sax operations (like _in_table, end_td etc). But I'm not sure 
if this was a good idea now. I think I should leave it to you to 
determine which methods have become unnecessary.

> The requirements are the same:
> - process transparently the names/titlesRefs (I can add it later,
>   there is no need to write the code right now; it's enough if it
>   can be added easily)
> - get a unicode [1] string at input, parse it according to some rules
>   (the ones defined in the feed_dom method, actually) and return
>   a set of dictionary with the data and names/titlesRefs.
> 
> In your opinion what's the best design?  A complete replica of
> the actual ParserBase class, leaving to the feed_dom of the subclasses
> the parsing work?

Like I said, I have to get to understand the existing parsers better. 
I've only examined the search movie/person/character/company parsers. I 
also have to look at the individual movie/person/whatever parsers. What 
do you say, which one should I look at next? Maybe it will give me a 
better idea about what the common parts are between those parsers and 
see what the base parser could implement.

> I'm sure that some (many?) complex parser will require some (a lot
> of?) code to be written, but there are also many simpler code.
> Do you think these can be handled by a generic "feed_dom" code,
> using a set of provided parameters (like the one you've put in
> the _k dictionary of the HTMLSearchMovieParser class)?
> 

A generic feed_dom code could be aiming a little higher than necessary 
at the moment. But I think we could get there in a few iterations. Then 
again, it could also turn out to be not so complicated. I can give a 
better opinion after I've seen the other parsers.

There is one glitch I couldn't solve: the lxml parser handles unicode as 
the existing (sax) parser does but the beautifulsoup parser uses 
entities at some places. For example, if you search for "a better 
tomorrow", the last entry becomes:
  "Yesterday, Today &#38; Tomorrow (1986)"

Turgut


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Imdbpy-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] First patch for DOM parser

Reply via email to