I completely misunderstood what you were saying. I thought that you asked me for abstract generation quality feedback in general. Now I realized that you are referring to the fact that I generated abstracts without a local MediaWiki instance. What I did however may be different from what you suspect though.

Here's what I did:
- I saw that you invoke api.php of local MediaWiki instance to parse wiki text. I didn't bother to set it up so I just replaced the URL with actual Wikipedia instance of the language I worked on. This caused the wiki text to be rendered with templates substituted. - After this modification I parsed wiki text from XML database dump using SimpleWikiParser and passed the PageNode to getAbstractWikiText method in modified AbstractExtractor - I saw that the returned text contains HTML markup so I removed it using an HTML sanitizer. I assumed that you use "modified" MediaWiki to cover this part but I wasn't sure. - I was not happy with short method in AbstractExtractor because it didn't recognize sentence boundaries correctly. I created my own shortening routine using java.text.BreakIterator with additional abbreviations checks.

From what you're saying below I suspect that you are interested in generating abstracts without a need to invoke MediaWiki neither locally nor remotely. That I haven't tried to do.

Sorry for the confusion but I'm very new to all this and I'm just trying to use some of the extraction framework code for my purposes. Are we on the same page now?

Regards,
Piotr

On 2012-10-03 10:08, Pablo N. Mendes wrote:

I have searched a bit through the list and only found an example in Italian.

*Article:*
http://it.wikipedia.org/wiki/Vasco_Rossi

*Rendered text:*
Vasco Rossi, anche noto come Vasco o con l'appellativo Il Blasco[7] (Zocca, 7 febbraio 1952), è un cantautore italiano.

*Source:*
{{Bio
|Nome = Vasco
|Cognome = Rossi
|PostCognomeVirgola = anche noto come '''Vasco''' o con l'appellativo '''''Il Blasco'''''<ref>[http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556 Ma Vasco Rossi torna a giugno. Il Blasco piace sempre] archivio.lastampa.it <http://archivio.lastampa.it></ref>
|Sesso = M
|LuogoNascita = Zocca
|GiornoMeseNascita = 7 febbraio
|AnnoNascita = 1952
|LuogoMorte =
|GiornoMeseMorte =
|AnnoMorte =
|Attività = cantautore
|Nazionalità = italiano
}}


If you could compare the output for both solutions with a few such pages, we could have an initial assessment of "text quality" as Dimitris put it.

Cheers,
Pablo

On Wed, Oct 3, 2012 at 9:30 AM, Dimitris Kontokostas <[email protected] <mailto:[email protected]>> wrote:

    I don't have a concrete  test-case, I have to search in blind.
    What I was thinking is that if we could create the abstracts with
    exactly the same way as the modified mw we could make a string
    comparison and  test how many are different and how. Depending on
    the number and frequency of the text rendering templates that
    exist in the abstracts result we could try to resolve them manually.

    Removing the local Wikipedia mirror dependency for the extraction
    could be a huge plus but we shouldn't compromise on quality.
    Any other ideas?

    Best,
    Dimitris


    On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes
    <[email protected] <mailto:[email protected]>> wrote:


        Perhaps it would help the discussion if we got more concrete.
        Dimitris, do you have a favorite abstract that is problematic
        (therefore justifies using the modified MediaWiki)? Perhaps
        you can paste the wiki markup source and the desired outcome
        and Piotr can respond with the rendering by his patch.

        On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas"
        <[email protected] <mailto:[email protected]>> wrote:
        >
        >
        > On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski
        <[email protected] <mailto:[email protected]>> wrote:
        >>
        >> What do you mean by text quality? The text itself is as
        good as the first couple of sentences in the Wikipedia article
        you take it from, right?
        >
        >
        > Well, that is what I am asking :) Is it (exactly) the same text?
        > The problem is with some templates that render text (i.e.
        date templates) If we can measure their usage extend we could
        see if this is the way to go.
        >
        > Best,
        > Dimitris
        >
        >>
        >>
        >> Piotr
        >>
        >>
        >> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
        >>>
        >>> Our main interest is the text quality, if we get this
        right the shortening / tweaking should be the easy part :)
        >>>
        >>> Could you please give us with some text quality feedback
        and if it is good maybe we can start testing it to other
        languages as well
        >>>
        >>> Best,
        >>> Dimitris
        >>>
        >>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski
        <[email protected] <mailto:[email protected]>> wrote:
        >>>>
        >>>> I haven't done extensive tests but one thing to improve
        for sure is the abstract shortening algorithm. You currently
        use a simple regex to solve a complex problem of breaking down
        natural language text into sentences. java.text.BreakIterator
        yields better results and is also locale sensitive. You might
        also want to take a look at more advanced boundary analysis
        library at http://userguide.icu-project.org/boundaryanalysis.
        >>>>
        >>>> Regards,
        >>>> Piotr




-- Kontokostas Dimitris




--
---
Pablo N. Mendes
http://pablomendes.com
Events: http://wole2012.eurecom.fr <http://wole2012.eurecom.fr/>


------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to