On Mon, Aug 10, 2009 at 8:26 AM, Joel
Nothman<[email protected]> wrote:
>
>
> Hi,
>
> I am trying to promote the use of mwlib in applications which process
> Wikipedia, Wiktionary, etc. for natural language processing purposes (e.g.
> it is used in my papers at http://www.joelnothman.com/research/).
>
> Python is a popular language for NLP (e.g. http://www.nltk.org/);
> Wikipedia is lately a very important source of language and world
> knowledge; and mwlib can provide a fairly accurate parse-tree of MW
> markup, while quality assurance is left to PediaPress, so us researchers
> can focus on language technology (and occasionally push back to the mwlib
> tip).
>
> NLP researchers would mostly want to use mwlib as a parser, and then
> process/extract elements in the parse tree, or convert it to cleaned
> paragraphs of text.
>
> The fact that people want to use mwlib for things other than publishing
> books means that the API needs to be kept clean and fairly stable. It
> would be nice to get occasional changelogs so that we know when to update
> our working copies.
>
> Unfortunately, things in the API seem to be getting messier. In an old
> checkout of mwlib, I can do:
>
>>>> from mwlib import uparser
>>>> dir(uparser.simpleparse(''))
>  Article 'unknown': 0 children
> ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__',
> '__getattribute__', '__hash__', '__init__', '__iter__', '__module__',
> '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
> '__setattr__', '__str__', '__weakref__', '_asText', 'allchildren',
> 'append', 'asText', 'caption', 'children', 'filter', 'find', 'hasContent',
> 'show']
>
> At the tip, I now get:
> ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__',
> '__getattribute__', '__hash__', '__init__', '__iter__', '__module__',
> '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
> '__setattr__', '__str__', '__weakref__', '_asText', '_get_text',
> '_set_text', '_text', 'align', 'allchildren', 'asText', 'blocknode',
> 'caption', 'children', 'filter', 'find', 'frame', 'interwiki',
> 'join_as_text', 'langlink', 'len', 'level', 'lineprefix', 'namespace',
> 'ns', 'rawtagname', 'show', 'source', 'start', 't_2box_close',
> 't_2box_open', 't_begin_table', 't_begintable', 't_break', 't_colon',
> 't_column', 't_comment', 't_complex_article', 't_complex_caption',
> 't_complex_compat', 't_complex_indent', 't_complex_line',
> 't_complex_link', 't_complex_named_url', 't_complex_node',
> 't_complex_preformatted', 't_complex_section', 't_complex_style',
> 't_complex_table', 't_complex_table_cell', 't_complex_table_row',
> 't_complex_tag', 't_end', 't_end_table', 't_endsection', 't_endtable',
> 't_entity', 't_hrule', 't_html_tag', 't_html_tag_end', 't_http_url',
> 't_item', 't_magicword', 't_newline', 't_pre', 't_row', 't_section',
> 't_section_end', 't_semicolon', 't_singlequote', 't_special',
> 't_tablecaption', 't_text', 't_uniq', 't_urllink', 't_vlist', 'tagname',
> 'target', 'text', 'thumb', 'token2name', 'type', 'vlist']
>
> Why does every parse tree node need all these attributes? Can we clean
> this up a little to make mwlib parse trees simpler to work with?
>

Well, these are mostly class attributes so it shouldn't matter.

What's worse is that we have 3 different parse tree incarnations,
which we'll probably merge (again changing the API).

You'll probably have to wait for a 1.0 version if you want API stability, sorry.

- Ralf

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/mwlib?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to