[mwlib] Re: mwlib for NLP, and cleaning up the API

Ralf Schmitt Tue, 11 Aug 2009 00:11:13 -0700

"Joel Nothman" <[email protected]> writes:

>>
>> Well, these are mostly class attributes so it shouldn't matter.
>>
>> What's worse is that we have 3 different parse tree incarnations,
>> which we'll probably merge (again changing the API).
>>
>> You'll probably have to wait for a 1.0 version if you want API  
>> stability, sorry.
>>
>> - Ralf
>
> Yes, I noticed multiple incarnations in your repository after writing my  
> email, and was curious about the aims of mwlib-tidy.
>


mwlib-tidy was/is the branch where I removed mwapidb and some other
obsolete stuff. It's already merged in tip.

mwlib.refine.* is the new parser. It's the default parser since
march. Before that we had a recursive descent parser. That old parser
used to create instances of different classes (Article, Paragraph,
Style, ...). We also have our advanced tree stuff, where again different
classes are used in the parse tree. E.g. it changes Style nodes to
Italic, Emphasized, Strong,... nodes. If taken to an extreme, this would
create classes for every single html tag...

The other extreme is using a single class to represent all objects in
the parse tree. And that's what mwlib.refine does.  It uses instances of
mwlib.utoken.token to build it's syntax tree. 


> I don't understand why there is a need for so many class variables such as  
> "thumb" and "langlink" to be part of the __dict__ of every node. As
> far as

they are not part __dict__ of every node, they belong to the
class and give somehow sane defaults...at least we do not have to use 
hasattr(...)
  
> I can tell, .langlink is only used in __repr__; surely there is a neater  
> way to do this.

It's also used by mwlib.refine.compat:

            elif node.langlink:
                node.__class__ = N.LangLink
                node.namespace = node.target.split(":", 1)[0]

changing the class of that node to LangLink and making it compatible with
the old parser.

>
> Ideally, I would not only like to see parse tree properties simplified for  
> increased usability, but also that if I wanted to pickle the parse trees,  
> they wouldn't be excessively enormous. (Currently pickled-and-zipped  
> parses of English Wikipedia take up 40GB using an mwlib from last year.)
>

never tried to pickle a parse tree. you're on your own here..

> If there's anything I can do to help simplify the parse nodes, I'm willing  
> to help out. But I'm afraid of doing much, precisely because there seem to  
> be too many incarnations at the moment.
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/mwlib?hl=en
-~----------~----~----~----~------~----~------~--~---

[mwlib] Re: mwlib for NLP, and cleaning up the API

Reply via email to