Re: [Dbpedia-discussion] template parsing bug

Dimitris Kontokostas Wed, 03 Oct 2012 23:17:35 -0700

I think you did exactly that with an unnecessary call to wikipedia. The
PageNode is a parameter to the AbstractExtractor.extract so you could call
that directly.


The patched mw is here:
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction

I was thinking we have 2 future tasks regarding this
1) Create an "abstract" mediawiki extension and get rid of the patched old
mediawiki
2) See if a wikitext2text approach works (what you tried to do)

You could use the shortening function from the mw code and then maybe
contribute your code back ;-)

Best,
Dimitris

PS. anyone else from the community that has some time to implement #1 is
welcome


On Wed, Oct 3, 2012 at 9:58 PM, Piotr Jagielski <[email protected]>wrote:

>  I completely misunderstood what you were saying. I thought that you
> asked me for abstract generation quality feedback in general. Now I
> realized that you are referring to the fact that I generated abstracts
> without a local MediaWiki instance. What I did however may be different
> from what you suspect though.
>
> Here's what I did:
> - I saw that you invoke api.php of local MediaWiki instance to parse wiki
> text. I didn't bother to set it up so I just replaced the URL with actual
> Wikipedia instance of the language I worked on. This caused the wiki text
> to be rendered with templates substituted.
> - After this modification I parsed wiki text from XML database dump using
> SimpleWikiParser and passed the PageNode to getAbstractWikiText method in
> modified AbstractExtractor
> - I saw that the returned text contains HTML markup so I removed it using
> an HTML sanitizer. I assumed that you use "modified" MediaWiki to cover
> this part but I wasn't sure.
> - I was not happy with short method in AbstractExtractor because it didn't
> recognize sentence boundaries correctly. I created my own shortening
> routine using java.text.BreakIterator with additional abbreviations checks.
>
> From what you're saying below I suspect that you are interested in
> generating abstracts without a need to invoke MediaWiki neither locally nor
> remotely. That I haven't tried to do.
>
> Sorry for the confusion but I'm very new to all this and I'm just trying
> to use some of the extraction framework code for my purposes. Are we on the
> same page now?
>
> Regards,
> Piotr
>
>
> On 2012-10-03 10:08, Pablo N. Mendes wrote:
>
>
>  I have searched a bit through the list and only found an example in
> Italian.
>
>  *Article:*
> http://it.wikipedia.org/wiki/Vasco_Rossi
>
>  *Rendered text:*
> Vasco Rossi, anche noto come Vasco o con l'appellativo Il Blasco[7]
> (Zocca, 7 febbraio 1952), è un cantautore italiano.
>
>  *Source:*
>  {{Bio
> |Nome = Vasco
> |Cognome = Rossi
> |PostCognomeVirgola = anche noto come '''Vasco''' o con l'appellativo
> '''''Il Blasco'''''<ref>[
> http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556Ma
>  Vasco Rossi torna a giugno. Il Blasco piace sempre]
> archivio.lastampa.it</ref>
> |Sesso = M
> |LuogoNascita = Zocca
> |GiornoMeseNascita = 7 febbraio
> |AnnoNascita = 1952
> |LuogoMorte =
> |GiornoMeseMorte =
> |AnnoMorte =
> |Attività = cantautore
> |Nazionalità = italiano
> }}
>
>
>  If you could compare the output for both solutions with a few such
> pages, we could have an initial assessment of "text quality" as Dimitris
> put it.
>
>  Cheers,
> Pablo
>
> On Wed, Oct 3, 2012 at 9:30 AM, Dimitris Kontokostas <[email protected]>wrote:
>
>> I don't have a concrete  test-case, I have to search in blind.
>> What I was thinking is that if we could create the abstracts with exactly
>> the same way as the modified mw we could make a string comparison and  test
>> how many are different and how. Depending on the number and frequency of
>> the text rendering templates that exist in the abstracts result we could
>> try to resolve them manually.
>>
>> Removing the local Wikipedia mirror dependency for the extraction could
>> be a huge plus but we shouldn't compromise on quality.
>> Any other ideas?
>>
>> Best,
>> Dimitris
>>
>>
>> On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes <[email protected]>wrote:
>>
>>>
>>> Perhaps it would help the discussion if we got more concrete. Dimitris,
>>> do you have a favorite abstract that is problematic (therefore justifies
>>> using the modified MediaWiki)? Perhaps you can paste the wiki markup source
>>> and the desired outcome and Piotr can respond with the rendering by his
>>> patch.
>>>
>>> On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas" <[email protected]>
>>> wrote:
>>> >
>>> >
>>> > On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski <
>>> [email protected]> wrote:
>>> >>
>>> >> What do you mean by text quality? The text itself is as good as the
>>> first couple of sentences in the Wikipedia article you take it from, right?
>>> >
>>> >
>>> > Well, that is what I am asking :) Is it (exactly) the same text?
>>> > The problem is with some templates that render text (i.e. date
>>> templates) If we can measure their usage extend we could see if this is the
>>> way to go.
>>> >
>>> > Best,
>>> > Dimitris
>>> >
>>> >>
>>> >>
>>> >> Piotr
>>> >>
>>> >>
>>> >> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
>>> >>>
>>> >>> Our main interest is the text quality, if we get this right the
>>> shortening / tweaking should be the easy part :)
>>> >>>
>>> >>> Could you please give us with some text quality feedback and if it
>>> is good maybe we can start testing it to other languages as well
>>> >>>
>>> >>> Best,
>>> >>> Dimitris
>>> >>>
>>> >>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski <
>>> [email protected]> wrote:
>>> >>>>
>>> >>>> I haven't done extensive tests but one thing to improve for sure is
>>> the abstract shortening algorithm. You currently use a simple regex to
>>> solve a complex problem of breaking down natural language text into
>>> sentences. java.text.BreakIterator yields better results and is also locale
>>> sensitive. You might also want to take a look at more advanced boundary
>>> analysis library at http://userguide.icu-project.org/boundaryanalysis.
>>> >>>>
>>> >>>> Regards,
>>> >>>> Piotr
>>>
>>
>>
>>
>>  --
>> Kontokostas Dimitris
>>
>
>
>
>  --
> ---
> Pablo N. Mendes
> http://pablomendes.com
> Events: http://wole2012.eurecom.fr
>
>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] template parsing bug

Reply via email to