Re: [Dbpedia-discussion] template parsing bug

Piotr Jagielski Wed, 03 Oct 2012 11:59:27 -0700

I completely misunderstood what you were saying. I thought that youasked me for abstract generation quality feedback in general. Now Irealized that you are referring to the fact that I generated abstractswithout a local MediaWiki instance. What I did however may be differentfrom what you suspect though.


Here's what I did:

- I saw that you invoke api.php of local MediaWiki instance to parsewiki text. I didn't bother to set it up so I just replaced the URL withactual Wikipedia instance of the language I worked on. This caused thewiki text to be rendered with templates substituted.- After this modification I parsed wiki text from XML database dumpusing SimpleWikiParser and passed the PageNode to getAbstractWikiTextmethod in modified AbstractExtractor- I saw that the returned text contains HTML markup so I removed itusing an HTML sanitizer. I assumed that you use "modified" MediaWiki tocover this part but I wasn't sure.- I was not happy with short method in AbstractExtractor because itdidn't recognize sentence boundaries correctly. I created my ownshortening routine using java.text.BreakIterator with additionalabbreviations checks.

From what you're saying below I suspect that you are interested ingenerating abstracts without a need to invoke MediaWiki neither locallynor remotely. That I haven't tried to do.

Sorry for the confusion but I'm very new to all this and I'm just tryingto use some of the extraction framework code for my purposes. Are we onthe same page now?


Regards,
Piotr

On 2012-10-03 10:08, Pablo N. Mendes wrote:

I have searched a bit through the list and only found an example inItalian.


*Article:*
http://it.wikipedia.org/wiki/Vasco_Rossi

*Rendered text:*

Vasco Rossi, anche noto come Vasco o con l'appellativo Il Blasco[7](Zocca, 7 febbraio 1952), è un cantautore italiano.


*Source:*
{{Bio
|Nome = Vasco
|Cognome = Rossi

|PostCognomeVirgola = anche noto come '''Vasco''' o con l'appellativo'''''IlBlasco'''''<ref>[http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556Ma Vasco Rossi torna a giugno. Il Blasco piace sempre]archivio.lastampa.it <http://archivio.lastampa.it></ref>

|Sesso = M
|LuogoNascita = Zocca
|GiornoMeseNascita = 7 febbraio
|AnnoNascita = 1952
|LuogoMorte =
|GiornoMeseMorte =
|AnnoMorte =
|Attività = cantautore
|Nazionalità = italiano
}}

If you could compare the output for both solutions with a few suchpages, we could have an initial assessment of "text quality" asDimitris put it.


Cheers,
Pablo

On Wed, Oct 3, 2012 at 9:30 AM, Dimitris Kontokostas<[email protected] <mailto:[email protected]>> wrote:


    I don't have a concrete  test-case, I have to search in blind.
    What I was thinking is that if we could create the abstracts with
    exactly the same way as the modified mw we could make a string
    comparison and  test how many are different and how. Depending on
    the number and frequency of the text rendering templates that
    exist in the abstracts result we could try to resolve them manually.

    Removing the local Wikipedia mirror dependency for the extraction
    could be a huge plus but we shouldn't compromise on quality.
    Any other ideas?

    Best,
    Dimitris


    On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes
    <[email protected] <mailto:[email protected]>> wrote:


        Perhaps it would help the discussion if we got more concrete.
        Dimitris, do you have a favorite abstract that is problematic
        (therefore justifies using the modified MediaWiki)? Perhaps
        you can paste the wiki markup source and the desired outcome
        and Piotr can respond with the rendering by his patch.

        On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas"
        <[email protected] <mailto:[email protected]>> wrote:
        >
        >
        > On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski
        <[email protected] <mailto:[email protected]>> wrote:
        >>
        >> What do you mean by text quality? The text itself is as
        good as the first couple of sentences in the Wikipedia article
        you take it from, right?
        >
        >
        > Well, that is what I am asking :) Is it (exactly) the same text?
        > The problem is with some templates that render text (i.e.
        date templates) If we can measure their usage extend we could
        see if this is the way to go.
        >
        > Best,
        > Dimitris
        >
        >>
        >>
        >> Piotr
        >>
        >>
        >> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
        >>>
        >>> Our main interest is the text quality, if we get this
        right the shortening / tweaking should be the easy part :)
        >>>
        >>> Could you please give us with some text quality feedback
        and if it is good maybe we can start testing it to other
        languages as well
        >>>
        >>> Best,
        >>> Dimitris
        >>>
        >>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski
        <[email protected] <mailto:[email protected]>> wrote:
        >>>>
        >>>> I haven't done extensive tests but one thing to improve
        for sure is the abstract shortening algorithm. You currently
        use a simple regex to solve a complex problem of breaking down
        natural language text into sentences. java.text.BreakIterator
        yields better results and is also locale sensitive. You might
        also want to take a look at more advanced boundary analysis
        library at http://userguide.icu-project.org/boundaryanalysis.
        >>>>
        >>>> Regards,
        >>>> Piotr

--Kontokostas Dimitris





--
---
Pablo N. Mendes
http://pablomendes.com
Events: http://wole2012.eurecom.fr <http://wole2012.eurecom.fr/>

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] template parsing bug

Reply via email to