On Mon, Aug 10, 2009 at 8:26 AM, Joel Nothman<[email protected]> wrote: > > > Hi, > > I am trying to promote the use of mwlib in applications which process > Wikipedia, Wiktionary, etc. for natural language processing purposes (e.g. > it is used in my papers at http://www.joelnothman.com/research/). > > Python is a popular language for NLP (e.g. http://www.nltk.org/); > Wikipedia is lately a very important source of language and world > knowledge; and mwlib can provide a fairly accurate parse-tree of MW > markup, while quality assurance is left to PediaPress, so us researchers > can focus on language technology (and occasionally push back to the mwlib > tip). > > NLP researchers would mostly want to use mwlib as a parser, and then > process/extract elements in the parse tree, or convert it to cleaned > paragraphs of text. > > The fact that people want to use mwlib for things other than publishing > books means that the API needs to be kept clean and fairly stable. It > would be nice to get occasional changelogs so that we know when to update > our working copies. > > Unfortunately, things in the API seem to be getting messier. In an old > checkout of mwlib, I can do: > >>>> from mwlib import uparser >>>> dir(uparser.simpleparse('')) > Article 'unknown': 0 children > ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', > '__getattribute__', '__hash__', '__init__', '__iter__', '__module__', > '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', > '__setattr__', '__str__', '__weakref__', '_asText', 'allchildren', > 'append', 'asText', 'caption', 'children', 'filter', 'find', 'hasContent', > 'show'] > > At the tip, I now get: > ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', > '__getattribute__', '__hash__', '__init__', '__iter__', '__module__', > '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', > '__setattr__', '__str__', '__weakref__', '_asText', '_get_text', > '_set_text', '_text', 'align', 'allchildren', 'asText', 'blocknode', > 'caption', 'children', 'filter', 'find', 'frame', 'interwiki', > 'join_as_text', 'langlink', 'len', 'level', 'lineprefix', 'namespace', > 'ns', 'rawtagname', 'show', 'source', 'start', 't_2box_close', > 't_2box_open', 't_begin_table', 't_begintable', 't_break', 't_colon', > 't_column', 't_comment', 't_complex_article', 't_complex_caption', > 't_complex_compat', 't_complex_indent', 't_complex_line', > 't_complex_link', 't_complex_named_url', 't_complex_node', > 't_complex_preformatted', 't_complex_section', 't_complex_style', > 't_complex_table', 't_complex_table_cell', 't_complex_table_row', > 't_complex_tag', 't_end', 't_end_table', 't_endsection', 't_endtable', > 't_entity', 't_hrule', 't_html_tag', 't_html_tag_end', 't_http_url', > 't_item', 't_magicword', 't_newline', 't_pre', 't_row', 't_section', > 't_section_end', 't_semicolon', 't_singlequote', 't_special', > 't_tablecaption', 't_text', 't_uniq', 't_urllink', 't_vlist', 'tagname', > 'target', 'text', 'thumb', 'token2name', 'type', 'vlist'] > > Why does every parse tree node need all these attributes? Can we clean > this up a little to make mwlib parse trees simpler to work with? >
Well, these are mostly class attributes so it shouldn't matter. What's worse is that we have 3 different parse tree incarnations, which we'll probably merge (again changing the API). You'll probably have to wait for a 1.0 version if you want API stability, sorry. - Ralf --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mwlib" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/mwlib?hl=en -~----------~----~----~----~------~----~------~--~---
