+1 for a standalone WikiText to Text function.

For example, being able to generate a plain text version of a page without a 
call to a local MediaWiki instance (or to Wikipedia itself) simplifies 
drastically the porting of the DBpedia extraction framework (aka DEF) onto 
distributed platform such as Hadoop, because MediaWiki instances don't have to 
be shipped and set up on each node at each run. 

It also make DEF lighter, and easier to install on a single box, whatever its 
size.


Nicolas.



On Oct 4, 2012, at 2:12 PM, Piotr Jagielski <[email protected]> wrote:

> I'm interested in exploring wiki text to text approach. I'm just wondering 
> what's your idea to do it without either local MediaWiki or remote Wikipedia 
> call. Do you have any code in extraction framework that can be used to parse 
> wiki markup? I tried toPlainText() on PageNode but it appears only to replace 
> links but it keeps all wiki formatting like bold, italics, lists etc. 
> 
> Regards,
> Piotr
> 
> On 2012-10-04 08:16, Dimitris Kontokostas wrote:
>> I think you did exactly that with an unnecessary call to wikipedia. The 
>> PageNode is a parameter to the AbstractExtractor.extract so you could call 
>> that directly.
>> 
>> The patched mw is here: 
>> http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction
>> 
>> I was thinking we have 2 future tasks regarding this
>> 1) Create an "abstract" mediawiki extension and get rid of the patched old 
>> mediawiki
>> 2) See if a wikitext2text approach works (what you tried to do)
>> 
>> You could use the shortening function from the mw code and then maybe 
>> contribute your code back ;-)
>> 
>> Best,
>> Dimitris
>> 
>> PS. anyone else from the community that has some time to implement #1 is 
>> welcome 
>> 
>> 
>> On Wed, Oct 3, 2012 at 9:58 PM, Piotr Jagielski <[email protected]> 
>> wrote:
>>> I completely misunderstood what you were saying. I thought that you asked 
>>> me for abstract generation quality feedback in general. Now I realized that 
>>> you are referring to the fact that I generated abstracts without a local 
>>> MediaWiki instance. What I did however may be different from what you 
>>> suspect though. 
>>> 
>>> Here's what I did:
>>> - I saw that you invoke api.php of local MediaWiki instance to parse wiki 
>>> text. I didn't bother to set it up               so I just replaced the URL 
>>> with actual Wikipedia instance of the language I worked on. This caused the 
>>> wiki text to be rendered with templates substituted. 
>>> - After this modification I parsed wiki text from XML database dump using 
>>> SimpleWikiParser and passed the PageNode to getAbstractWikiText method in 
>>> modified AbstractExtractor
>>> - I saw that the returned text contains HTML markup so I removed it using 
>>> an HTML sanitizer. I assumed that you use "modified" MediaWiki to cover 
>>> this part but I wasn't sure.
>>> - I was not happy with short method in AbstractExtractor because it didn't 
>>> recognize sentence boundaries correctly. I created my own shortening 
>>> routine using java.text.BreakIterator with additional abbreviations checks. 
>>> 
>>> From what you're saying below I suspect that you are interested in 
>>> generating abstracts without a need to invoke MediaWiki neither locally nor 
>>> remotely. That I haven't tried to do.
>>> 
>>> Sorry for the confusion but I'm very new to all this and I'm just trying to 
>>> use some of the extraction framework code for my purposes. Are we on the 
>>> same page now?
>>> 
>>> Regards,
>>> Piotr
>>> 
>>> 
>>> On 2012-10-03 10:08, Pablo N. Mendes wrote:
>>>> 
>>>> I have searched a bit through the list and only found an example in 
>>>> Italian.
>>>> 
>>>> Article:
>>>> http://it.wikipedia.org/wiki/Vasco_Rossi
>>>> 
>>>> Rendered text:
>>>> Vasco Rossi, anche noto come Vasco o con l'appellativo Il Blasco[7] 
>>>> (Zocca, 7 febbraio 1952), è un cantautore italiano.
>>>> 
>>>> Source:
>>>> {{Bio
>>>> |Nome = Vasco
>>>> |Cognome = Rossi
>>>> |PostCognomeVirgola = anche noto come '''Vasco''' o con l'appellativo 
>>>> '''''Il 
>>>> Blasco'''''<ref>[http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556
>>>>  Ma Vasco Rossi torna a giugno. Il Blasco piace sempre] 
>>>> archivio.lastampa.it</ref>
>>>> |Sesso = M
>>>> |LuogoNascita = Zocca
>>>> |GiornoMeseNascita = 7 febbraio
>>>> |AnnoNascita = 1952
>>>> |LuogoMorte =
>>>> |GiornoMeseMorte = 
>>>> |AnnoMorte = 
>>>> |Attività = cantautore
>>>> |Nazionalità = italiano
>>>> }}
>>>> 
>>>> 
>>>> If you could compare the output for both solutions with a few such pages, 
>>>> we could have an initial assessment of "text quality" as Dimitris put it.
>>>> 
>>>> Cheers,
>>>> Pablo
>>>> 
>>>> On Wed, Oct 3, 2012 at 9:30 AM, Dimitris Kontokostas <[email protected]> 
>>>> wrote:
>>>>> I don't have a concrete  test-case, I have to search in blind. 
>>>>> What I was thinking is that if we could create the abstracts with exactly 
>>>>> the same way as the modified mw we could make a string comparison and  
>>>>> test how many are different and how. Depending on the number and 
>>>>> frequency of the text rendering templates that exist in the abstracts 
>>>>> result we could try to resolve them manually.
>>>>> 
>>>>> Removing the local Wikipedia mirror dependency for the extraction could 
>>>>> be a huge plus but we shouldn't compromise on quality.
>>>>> Any other ideas?
>>>>> 
>>>>> Best,
>>>>> Dimitris
>>>>> 
>>>>> 
>>>>> On Wed, Oct 3, 2012 at 9:41 AM, Pablo N. Mendes <[email protected]> 
>>>>> wrote:
>>>>>> 
>>>>>> Perhaps it would help the discussion if we got more concrete. Dimitris, 
>>>>>> do you have a favorite abstract that is problematic (therefore justifies 
>>>>>> using the modified MediaWiki)? Perhaps you can paste the wiki markup 
>>>>>> source and the desired outcome and Piotr can respond with the rendering 
>>>>>> by his patch.
>>>>>> 
>>>>>> On Oct 3, 2012 8:31 AM, "Dimitris Kontokostas" <[email protected]> wrote:
>>>>>> >
>>>>>> >
>>>>>> > On Wed, Oct 3, 2012 at 12:42 AM, Piotr Jagielski 
>>>>>> > <[email protected]> wrote:
>>>>>> >>
>>>>>> >> What do you mean by text quality? The text itself is as good as the 
>>>>>> >> first couple of                                       sentences in 
>>>>>> >> the Wikipedia article you take it from, right? 
>>>>>> >
>>>>>> >
>>>>>> > Well, that is what I am asking :) Is it (exactly) the same text?
>>>>>> > The problem is with some templates that render text (i.e. date 
>>>>>> > templates) If we can measure their usage extend we could see if this 
>>>>>> > is the way to go.
>>>>>> >
>>>>>> > Best,
>>>>>> > Dimitris
>>>>>> >  
>>>>>> >>
>>>>>> >>
>>>>>> >> Piotr
>>>>>> >>
>>>>>> >>
>>>>>> >> On 2012-10-02 22:49, Dimitris Kontokostas wrote:
>>>>>> >>>
>>>>>> >>> Our main interest is the text quality, if we get this right the 
>>>>>> >>> shortening / tweaking should be the easy part :)
>>>>>> >>>
>>>>>> >>> Could you please give us with some text quality feedback and if it 
>>>>>> >>> is good maybe we can start testing it to other languages as well
>>>>>> >>>
>>>>>> >>> Best,
>>>>>> >>> Dimitris
>>>>>> >>>
>>>>>> >>> On Tue, Oct 2, 2012 at 11:11 PM, Piotr Jagielski 
>>>>>> >>> <[email protected]> wrote:
>>>>>> >>>>
>>>>>> >>>> I haven't done extensive tests but one thing to improve for sure is 
>>>>>> >>>> the abstract shortening algorithm. You currently use a simple regex 
>>>>>> >>>> to solve a complex problem of breaking down natural language text 
>>>>>> >>>> into sentences. java.text.BreakIterator yields                      
>>>>>> >>>>                  better results and is also locale sensitive. You 
>>>>>> >>>> might also want to take a look at more advanced boundary analysis 
>>>>>> >>>> library at http://userguide.icu-project.org/boundaryanalysis.
>>>>>> >>>>
>>>>>> >>>> Regards,
>>>>>> >>>> Piotr
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Kontokostas Dimitris
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> ---
>>>> Pablo N. Mendes
>>>> http://pablomendes.com
>>>> Events: http://wole2012.eurecom.fr
>> 
>> 
>> 
>> -- 
>> Kontokostas Dimitris
> 
> ------------------------------------------------------------------------------
> Don't let slow site performance ruin your business. Deploy New Relic APM
> Deploy New Relic app performance management and know exactly
> what is happening inside your Ruby, Python, PHP, Java, and .NET app
> Try New Relic at no cost today and get our sweet Data Nerd shirt too!
> http://p.sf.net/sfu/newrelic-dev2dev
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to