[mwlib] Re: mwlib for NLP, and cleaning up the API

Ralf Schmitt Thu, 13 Aug 2009 03:08:06 -0700

"Joel Nothman" <[email protected]> writes:

>
> Thanks for clarifying the nature of the parser change. There seem to be a  
> number of regressions in link parsing, some related to things I patched  
> last year (should I add tests?):
>
> (1) WAS:
>>>> uparser.simpleparse("[[''Donkey'']]")
>   Article 'unknown': 1 children
>       Paragraph '': 1 children
>           Link '': 1 children
>               'Donkey'
> (1) IS:
>>>> uparser.simpleparse("[[Donkey]]")
> Article
>      ArticleLink target=u"Donkey" ns=0
>
> I.e. no caption underneath link node which we added for consistency with  
> [[aaa|bbb]].


yes, I removed that one. It caused too much problems. I think someone
even complained that he can't distinguish between [[Donkey|Donkey]] and
[[Donkey]].

>
> (2) WAS:
>>>> uparser.simpleparse("[[''Donkey'']]")
>   Article 'unknown': 1 children
>       Paragraph '': 1 children
>           Link '': 1 children
>               Style "''": 1 children
>                   'Donkey'
> (2) IS:
>>>> uparser.simpleparse("[[''Donkey'']]")
> Article
>      ArticleLink target=u"''Donkey''" ns=0
>

strictly speaking  [[''Donkey'']] should not even create a
link. Mediawiki outputs 
[[Donkey]]
for that input, which seems totally stupid.


> (3) WAS:
>>>> print  
>>>> uparser.simpleparse("[[en:Donkey]]").children[0].children[0].target
>   Article 'unknown': 1 children
>       Paragraph '': 1 children
>           LangLink '': 1 children
>               'en:Donkey'
> Donkey
> (3) IS:
>>>> print uparser.simpleparse("[[en:Donkey]]").children[0].target
> Article
>      LangLink target=u'en:Donkey' interwiki='en' langlink='English'
> en:Donkey
>
> I.e. we had distinguished between a full_target and the stripped target,  
> which just had the title of the target page.

that also didn't work that great. Though I do not remember the exact reasons...

>
> (4) WAS:
>>>> print uparser.simpleparse("[[Donkey]]s")
>   Article 'unknown': 1 children
>       Paragraph '': 2 children
>           ArticleLink '': 1 children
>               'Donkeys'
> (4) IS:
>>>> print uparser.simpleparse("[[Donkey]]s")
> Article
>      ArticleLink target=u'Donkey' ns=0
>      u's'
>
> This *is* considered in an existing text case, though that test is marked  
> to fail.
>

That was intentional. I cannot pull in that "s" character because of
(1).


>
> Are these regressions intentional? Or are they side-effects of some other  
> change? Should I reimplement the features?
>

I don't miss them....

> Also note that previously parsing "Some string" would automatically create  
> a paragraph node. Now in order to create a paragraph node, "\n\n" needs to  
> be present (and "\n\nText" erroneously creates two paragraph nodes).
>

that's a side effect of the new implementation. Paragraphs are only
created on \n\n or around block nodes. I'm not quite sure if I consider
the second case an error. 

mediawiki renders "<div>\n\nbla</div>" as:
<div>
<p>bla</p>
</div>

whereas it renders 
<div>bla</div>
as 
<div>bla</div>



> It would seem that more of a test-driven development might be helpful.
>

we have around 600 tests, but could probably need more. 

Regards,
- Ralf

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mwlib" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/mwlib?hl=en
-~----------~----~----~----~------~----~------~--~---

[mwlib] Re: mwlib for NLP, and cleaning up the API

Reply via email to