Re: [Dbpedia-discussion] Italian short / long abstract problem

Piero Molino Tue, 27 Sep 2011 02:36:52 -0700

Hi Alessio,

you are giving examples about biographies. As i said some email ago, in italian 
wikipedia biographies are generated from the template and not written as text. 
This is the example for Vasco Rossi:


{{Artista musicale
|nome = Vasco Rossi
|nazione = Italia
|genere = Hard rock
|nota genere = <ref>[http://www.ondarock.it/italia/vascorossi.htm Biografia] su 
Ondarock.it</ref><ref>[http://musica.accordo.it/articles/2011/03/50522/anteprima-vasco-rossi-vivere-o-niente-in-uscita-il-29.html
 Anteprima Vasco Rossi: Vivere o Niente, in uscita il 29] 
musica.accordo.it</ref><ref>[http://www.debaser.it/recensionidb/ID_10371/Vasco_Rossi_Vado_Al_Massimo.htm
 Vasco Rossi: Vado al massimo] 
Debaser.it</ref><ref>[http://www.ondarock.it/recensioni/2008_rossi.htm 
ondarock.it - Recensione "''Il Mondo che Vorrei''"]</ref>
|genere2 = Pop rock
|nota genere2 = <ref>[http://www.allmusic.com/artist/p210021 allmusic.com - 
Vasco Rossi]</ref><ref>[http://www.ondarock.it/recensioni/2008_rossi.htm VASCO 
ROSSI - Il Mondo Che Vorrei] Ondarock.it</ref>
|anno inizio attività = 1977
|anno fine attività = in attività
|note periodo attività = 
|etichetta = [[Lotus (casa discografica)|Lotus]], [[Durium|Targa]], [[Carosello 
(casa discografica)|Carosello]], [[EMI Italiana]]
|tipo artista = Cantautore
|immagine = Vasco Rossi 2.jpg
|didascalia = Vasco Rossi
|url = [http://www.vascorossi.net/ vascorossi.net]
|numero totale album pubblicati = 25
|numero album studio = 16
|numero album live = 7
|numero raccolte = 2
}}

{{Bio
|Nome = Vasco
|Cognome = Rossi
|PostCognomeVirgola = anche noto come '''Vasco''' o con l'appellativo '''''Il 
Blasco'''''<ref>[http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556
 Ma Vasco Rossi torna a giugno. Il Blasco piace sempre] 
archivio.lastampa.it</ref>
|Sesso = M
|LuogoNascita = Zocca
|GiornoMeseNascita = 7 febbraio
|AnnoNascita = 1952
|LuogoMorte =
|GiornoMeseMorte = 
|AnnoMorte = 
|Attività = cantautore
|Nazionalità = italiano
}}

Autodefinitosi 
''provoca(u)tore'',<ref>[http://www.ufficiostampa.rai.it/UFFICIO_STAMPA_MAIN_DETTAGLIO_NEWS.aspx?IDSCHEDAARCHIVIONEWS=32517
 "VASCO, IL PROVOCAUTORE"] ufficiostampa.rai.it</ref> nella sua carriera 
trentennale ha pubblicato 25 [[Album discografico|album]] (di cui 16 in studio, 
7 [[Album discografico#Album live|live]] e 2 [[Compilation|raccolte]] 
ufficiali) e composto complessivamente più di 150 canzoni, nonché numerosi 
testi e musiche per altri interpreti. Con più di trenta milioni di copie 
vendute<ref>[http://www.primissima.it/film/scheda/questa_storia_qua/ Questa 
storia qua] Primissima.it</ref><ref 
name=pressbook>[http://www.vascorossi.net/notizie/pressbook-venezia/ "Il rock 
dà l'idea che tutti ce la possono fare"] Vascorossi.net</ref> è uno dei 
cantautori italiani di maggior successo e 
fama.<ref>[http://www.lastoriasiamonoi.rai.it/puntata.aspx?id=821 Solo Vasco] 
lastoriasiamonoi.rai.it</ref><ref>[http://www.vascorossi.net/rassegna-stampa/questa-storia-qua-cin/
 Questa storia qua] Vascorossi.net</ref>

As you can see, the first sentence is the one present also in the dbpedia 
abstract. It's a normal behaviour for dbpedia, the problem is the italian 
wikipedia text generation. So to obtain the same text as the wikipedia page, 
the dbpedia extractor should use the same generation algorithm (i don't know 
where to take it from).

I worked on an algorithm that tries to replicate the generation, and it works 
quite good even if it's a bit messy, but the real problem is: is this 
behaviour, generating text from bios, common in other categories of italian 
wikipedia pages? You should show examples that are not "people with biography" 
so we could realize that.

I was also thinking about asking wkipedia italy directly about that, maybe they 
can give detailed description of the issue and maybe give some code for the 
generation.

Regards,
Piero Molino




Il giorno 27/set/2011, alle ore 09:59, <[email protected]> ha 
scritto:

> Hello,
> 
>    Option number 2 is the only way for us (time). So, if you've a link or a 
> document about that step, it's
> a great way to start.
> 
>    I can produce test cases, just for example, the abstracts for 
> http://dbpedia.org/page/Vasco_Rossi, http://dbpedia.org/page/Gina_Lollobrigida
> http://dbpedia.org/page/Tom_Cruise start from the second sentence of related 
> Wikipedia articles.
> 
> Thanks,
> Alessio@ComplexityIntelligence
> 
> 
> 
> -------- Original Message --------
> Subject: Re: [Dbpedia-discussion] Italian short / long abstract problem
> From: Sebastian Hellmann <[email protected]>
> Date: Tue, September 27, 2011 12:26 am
> To: [email protected]
> Cc: Piero Molino <[email protected]>, 
> [email protected]
> 
> Hello all,
> I think the first important step is to produce a test case.
> Can you provide cases, where the extraction fails?
> 
> The best would be to have one file per article, that contains the original 
> Wikipedia source and one file that contains the expected output.
> 
> This will be a good basis to make regression test case in the future.
> 
> After you provided these test cases, there are two choices:
> 1. you can wait until the next Wikipedia dump in 6 months, I guess that is 
> the time, when we will try and fix the problem based on the test case you 
> provided.
> 2. try to fix it yourself (with a little help of us) and make a new Italian 
> dump (or the abstract data sets), which we will use to replace the current 
> one. (This would probably be much faster and you would learn how to 
> produce/tune the DBpedia data)
> 
> All the best,
> Sebastian
> 
> 
> 
> 
> On 09/23/2011 09:45 AM, [email protected] wrote:
>> 
>> Hi Piero,
>> 
>>   This seems to be a very common problem for the Italian dataset(s). I've 
>> done
>> some random tests, and more than 50% of abstracts are messed (this is very
>> random test so the numbers are only indicative).
>> 
>>    Just for example:
>> 
>>     http://dbpedia.org/page/Vasco_Rossi
>>     http://dbpedia.org/page/Gina_Lollobrigida
>>     http://dbpedia.org/page/Tom_Cruise
>> 
>>    and many others. Another problem 'Ornella Muti' is missed.
>> 
>>    As you can see, the problem is only for Italian, other language abstracts 
>> are ok.
>> 
>> Please let me know. I can help to fix that.
>> 
>> Alessio
>>    
>> -------- Original Message --------
>> Subject: Re: [Dbpedia-discussion] Italian short / long abstract problem
>> From: Piero Molino <[email protected]>
>> Date: Thu, September 22, 2011 10:12 am
>> To: Sebastian Hellmann <[email protected]>
>> Cc: [email protected],
>> [email protected]
>> 
>> Hello Alessio,
>> 
>> the fat that the abstract starts from the second sentence is probably due to 
>> the fact that many first sentences are generated from the templates in 
>> italian wikipedia.
>> A clear example is the bio template for peolpe.
>> So please take a look to the original wikipedia page source and see if this 
>> is the problem. If not, please give us some examples of the messed abstracts.
>> 
>> Regards,
>> Piero Molino
>> 
>> 
>> 
>> Il giorno 22/set/2011, alle ore 19:06, Sebastian Hellmann ha scritto:
>> 
>>> Dear Alessio,
>>> Sorry, but this is actually not on the top of our Todo list.
>>> We could assist you  a little in fixing the problem.
>>> Would you be willing to try it?
>>> Sebastian
>>> 
>>> On 09/22/2011 02:02 PM, [email protected] wrote:
>>>> 
>>>> Hello,
>>>> 
>>>>    In DBPedia 3.7, a big amunt of long and short abstract for the Italian 
>>>> language are messed.
>>>> They start from the second sentence of the Wikipedia article, skipping the 
>>>> first one, so the
>>>> abstract as a whole is of a little use as the subject is often unclear.
>>>> 
>>>>    Is possible to fix that problem ?
>>>> 
>>>> Alessio
>>>>  
>>>> ------------------------------------------------------------------------------
>>>> All the data continuously generated in your IT infrastructure contains a
>>>> definitive record of customers, application performance, security
>>>> threats, fraudulent activity and more. Splunk takes this data and makes
>>>> sense of it. Business sense. IT sense. Common sense.
>>>> http://p.sf.net/sfu/splunk-d2dcopy1
>>>> 
>>>> _______________________________________________
>>>> Dbpedia-discussion mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>> 
>>> ------------------------------------------------------------------------------
>>> All the data continuously generated in your IT infrastructure contains a
>>> definitive record of customers, application performance, security
>>> threats, fraudulent activity and more. Splunk takes this data and makes
>>> sense of it. Business sense. IT sense. Common sense.
>>> http://p.sf.net/sfu/splunk-d2dcopy1_______________________________________________
>>> Dbpedia-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>> 
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Italian short / long abstract problem

Reply via email to