I'm interested in exploring wiki text to text approach. I'm just wondering what's your idea to do it without either local MediaWiki or remote Wikipedia call. Do you have any code in extraction framework that can be used to parse wiki markup? I tried toPlainText() on PageNode but it appears only to replace links but it keeps all wiki formatting like bold, italics, lists etc.

Regards,
Piotr

On 2012-10-04 08:16, Dimitris Kontokostas wrote:
I think you did exactly that with an unnecessary call to wikipedia. The PageNode is a parameter to the AbstractExtractor.extract so you could call that directly.

The patched mw is here: http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction

I was thinking we have 2 future tasks regarding this
1) Create an "abstract" mediawiki extension and get rid of the patched old mediawiki
2) See if a wikitext2text approach works (what you tried to do)

You could use the shortening function from the mw code and then maybe contribute your code back ;-)

Best,
Dimitris

PS. anyone else from the community that has some time to implement #1 is welcome


On Wed, Oct 3, 2012 at 9:58 PM, Piotr Jagielski <[email protected] <mailto:[email protected]>> wrote:

    I completely misunderstood what you were saying. I thought that
    you asked me for abstract generation quality feedback in general.
    Now I realized that you are referring to the fact that I generated
    abstracts without a local MediaWiki instance. What I did however
    may be different from what you suspect though.

    Here's what I did:
    - I saw that you invoke api.php of local MediaWiki instance to
    parse wiki text. I didn't bother to set it up so I just replaced
    the URL with actual Wikipedia instance of the language I worked
    on. This caused the wiki text to be rendered with templates
    substituted.
    - After this modification I parsed wiki text from XML database
    dump using SimpleWikiParser and passed the PageNode to
    getAbstractWikiText method in modified AbstractExtractor
    - I saw that the returned text contains HTML markup so I removed
    it using an HTML sanitizer. I assumed that you use "modified"
    MediaWiki to cover this part but I wasn't sure.
    - I was not happy with short method in AbstractExtractor because
    it didn't recognize sentence boundaries correctly. I created my
    own shortening routine using java.text.BreakIterator with
    additional abbreviations checks.

    From what you're saying below I suspect that you are interested in
    generating abstracts without a need to invoke MediaWiki neither
    locally nor remotely. That I haven't tried to do.

    Sorry for the confusion but I'm very new to all this and I'm just
    trying to use some of the extraction framework code for my
    purposes. Are we on the same page now?

    Regards,
    Piotr


    On 2012-10-03 10:08, Pablo N. Mendes wrote:

    I have searched a bit through the list and only found an example
    in Italian.

    *Article:*
    http://it.wikipedia.org/wiki/Vasco_Rossi

    *Rendered text:*
    Vasco Rossi, anche noto come Vasco o con l'appellativo Il
    Blasco[7] (Zocca, 7 febbraio 1952), è un cantautore italiano.

    *Source:*
    {{Bio
    |Nome = Vasco
    |Cognome = Rossi
    |PostCognomeVirgola = anche noto come '''Vasco''' o con
    l'appellativo '''''Il
    
Blasco'''''<ref>[http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556
    Ma Vasco Rossi torna a giugno. Il Blasco piace sempre]
    archivio.lastampa.it <http://archivio.lastampa.it></ref>
    |Sesso = M
    |LuogoNascita = Zocca
    |GiornoMeseNascita = 7 febbraio
    |AnnoNascita = 1952
    |LuogoMorte =
    |GiornoMeseMorte =
    |AnnoMorte =
    |Attività = cantautore
    |Nazionalità = italiano
    }}


    If you could compare the output for both solutions with a few
    such pages, we could have an initial assessment of "text quality"
    as Dimitris put it.

    Cheers,
    Pablo

    On Wed, Oct 3, 2012 at 9:30 AM, Dimitris Kontokostas
    <[email protected] <mailto:[email protected]>> wrote:

        I don't have a concrete test-case, I have to search in blind.
        What I was thinking is that if we could create the abstracts
        with exactly the same way as the modified mw we could make a
        string comparison and  test how many are different and how.
        Depending on the number and frequency of the text rendering
        templates that exist in the abstracts result we could try to
        resolve them manually.

        Removing the local Wikipedia mirror dependency for the
        extraction could be a huge plus but we shouldn't compromise
        on quality.
        Any other ideas?

        Best,
        Dimitris


        On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes
        <[email protected] <mailto:[email protected]>> wrote:


            Perhaps it would help the discussion if we got more
            concrete. Dimitris, do you have a favorite abstract that
            is problematic (therefore justifies using the modified
            MediaWiki)? Perhaps you can paste the wiki markup source
            and the desired outcome and Piotr can respond with the
            rendering by his patch.

            On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas"
            <[email protected] <mailto:[email protected]>> wrote:
            >
            >
            > On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski
            <[email protected] <mailto:[email protected]>> wrote:
            >>
            >> What do you mean by text quality? The text itself is
            as good as the first couple of sentences in the Wikipedia
            article you take it from, right?
            >
            >
            > Well, that is what I am asking :) Is it (exactly) the
            same text?
            > The problem is with some templates that render text
            (i.e. date templates) If we can measure their usage
            extend we could see if this is the way to go.
            >
            > Best,
            > Dimitris
            >
            >>
            >>
            >> Piotr
            >>
            >>
            >> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
            >>>
            >>> Our main interest is the text quality, if we get this
            right the shortening / tweaking should be the easy part :)
            >>>
            >>> Could you please give us with some text quality
            feedback and if it is good maybe we can start testing it
            to other languages as well
            >>>
            >>> Best,
            >>> Dimitris
            >>>
            >>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski
            <[email protected] <mailto:[email protected]>> wrote:
            >>>>
            >>>> I haven't done extensive tests but one thing to
            improve for sure is the abstract shortening algorithm.
            You currently use a simple regex to solve a complex
            problem of breaking down natural language text into
            sentences. java.text.BreakIterator yields better results
            and is also locale sensitive. You might also want to take
            a look at more advanced boundary analysis library at
            http://userguide.icu-project.org/boundaryanalysis.
            >>>>
            >>>> Regards,
            >>>> Piotr




-- Kontokostas Dimitris




-- ---
    Pablo N. Mendes
    http://pablomendes.com
    Events: http://wole2012.eurecom.fr <http://wole2012.eurecom.fr/>





--
Kontokostas Dimitris

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to