Alessio,
I will go ahead and guess that a good starting point is here
AbstractExtractor [1]. I believe you have to install MediaWiki in your
machine and load Wikipedia. This extractor will connect to MediaWiki and ask
for a rendered page (with templates resolved), then it will extract the
abstract. It seems that the class uses HTTP requests for this. It sounds to
me as a waste of resources, so I'll also suggest that you change the code to
just call PHP directly [2].
That being said, I have to disclaim that all my knowledge about this is
based on overhearing conversations while brewing coffee. :)
I will also dare to give another idea. The guys behind Sweble (
http://sweble.org/) claim it is very thorough, and there seems to be a lot
of activity behind it. Maybe it is worth trying an alternative abstract
extractor based on their library that would go straight to the dump, render
pages and grab the abstracts for you. If this worked, it would be a step
forward with regard to installing MySQL, MediaWiki, etc. They have a nice
demo of the parser here:
http://sweble.org/2011/05/using-crystallball-the-sweble-parser-demo/
Cheers,
Pablo
[1]
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/2774b30ef50a/core/src/main/scala/org/dbpedia/extraction/mappings/AbstractExtractor.scala
[2] http://stackoverflow.com/questions/614995/calling-php-from-java
On Tue, Sep 27, 2011 at 11:34 AM, Piero Molino <[email protected]>wrote:
> Hi Alessio,
>
> you are giving examples about biographies. As i said some email ago, in
> italian wikipedia biographies are generated from the template and not
> written as text. This is the example for Vasco Rossi:
>
> {{Artista musicale
> |nome = Vasco Rossi
> |nazione = Italia
> |genere = Hard rock
> |nota genere = <ref>[http://www.ondarock.it/italia/vascorossi.htmBiografia] su
> Ondarock.it</ref><ref>[
> http://musica.accordo.it/articles/2011/03/50522/anteprima-vasco-rossi-vivere-o-niente-in-uscita-il-29.htmlAnteprima
> Vasco Rossi: Vivere o Niente, in uscita il 29]
> musica.accordo.it</ref><ref>[
> http://www.debaser.it/recensionidb/ID_10371/Vasco_Rossi_Vado_Al_Massimo.htmVasco
> Rossi: Vado al massimo]
> Debaser.it</ref><ref>[http://www.ondarock.it/recensioni/2008_rossi.htm
> ondarock.it - Recensione "''Il Mondo che Vorrei''"]</ref>
> |genere2 = Pop rock
> |nota genere2 = <ref>[http://www.allmusic.com/artist/p210021 allmusic.com
> - Vasco Rossi]</ref><ref>[http://www.ondarock.it/recensioni/2008_rossi.htm
> VASCO ROSSI - Il Mondo Che Vorrei] Ondarock.it</ref>
> |anno inizio attività = 1977
> |anno fine attività = in attività
> |note periodo attività =
> |etichetta = [[Lotus (casa discografica)|Lotus]], [[Durium|Targa]],
> [[Carosello (casa discografica)|Carosello]], [[EMI Italiana]]
> |tipo artista = Cantautore
> |immagine = Vasco Rossi 2.jpg
> |didascalia = Vasco Rossi
> |url = [http://www.vascorossi.net/ vascorossi.net]
> |numero totale album pubblicati = 25
> |numero album studio = 16
> |numero album live = 7
> |numero raccolte = 2
> }}
>
> {{Bio
> |Nome = Vasco
> |Cognome = Rossi
> |PostCognomeVirgola = anche noto come '''Vasco''' o con l'appellativo
> '''''Il Blasco'''''<ref>[
> http://archivio.lastampa.it/LaStampaArchivio/main/History/tmpl_viewObj.jsp?objid=1092556Ma
> Vasco Rossi torna a giugno. Il Blasco piace sempre]
> archivio.lastampa.it</ref>
> |Sesso = M
> |LuogoNascita = Zocca
> |GiornoMeseNascita = 7 febbraio
> |AnnoNascita = 1952
> |LuogoMorte =
> |GiornoMeseMorte =
> |AnnoMorte =
> |Attività = cantautore
> |Nazionalità = italiano
> }}
>
> Autodefinitosi ''provoca(u)tore'',<ref>[
> http://www.ufficiostampa.rai.it/UFFICIO_STAMPA_MAIN_DETTAGLIO_NEWS.aspx?IDSCHEDAARCHIVIONEWS=32517"VASCO,
> IL PROVOCAUTORE"]
> ufficiostampa.rai.it</ref> nella sua carriera trentennale ha pubblicato 25
> [[Album discografico|album]] (di cui 16 in studio, 7 [[Album
> discografico#Album live|live]] e 2 [[Compilation|raccolte]] ufficiali) e
> composto complessivamente più di 150 canzoni, nonché numerosi testi e
> musiche per altri interpreti. Con più di trenta milioni di copie
> vendute<ref>[http://www.primissima.it/film/scheda/questa_storia_qua/
> Questa storia qua] Primissima.it</ref><ref name=pressbook>[
> http://www.vascorossi.net/notizie/pressbook-venezia/ "Il rock dà l'idea
> che tutti ce la possono fare"] Vascorossi.net</ref> è uno dei cantautori
> italiani di maggior successo e
> fama.<ref>[http://www.lastoriasiamonoi.rai.it/puntata.aspx?id=821
> Solo Vasco]
> lastoriasiamonoi.rai.it</ref><ref>[http://www.vascorossi.net/rassegna-stampa/questa-storia-qua-cin/
> Questa storia qua] Vascorossi.net</ref>
>
> As you can see, the first sentence is the one present also in the dbpedia
> abstract. It's a normal behaviour for dbpedia, the problem is the italian
> wikipedia text generation. So to obtain the same text as the wikipedia page,
> the dbpedia extractor should use the same generation algorithm (i don't know
> where to take it from).
>
> I worked on an algorithm that tries to replicate the generation, and it
> works quite good even if it's a bit messy, but the real problem is: is this
> behaviour, generating text from bios, common in other categories of italian
> wikipedia pages? You should show examples that are not "people with
> biography" so we could realize that.
>
> I was also thinking about asking wkipedia italy directly about that, maybe
> they can give detailed description of the issue and maybe give some code for
> the generation.
>
> Regards,
> Piero Molino
>
>
>
>
> Il giorno 27/set/2011, alle ore 09:59, <
> [email protected]> ha scritto:
>
> Hello,
>
> Option number 2 is the only way for us (time). So, if you've a link or a
> document about that step, it's
> a great way to start.
>
> I can produce test cases, just for example, the abstracts for
> http://dbpedia.org/page/Vasco_Rossi,
> http://dbpedia.org/page/Gina_Lollobrigida
> http://dbpedia.org/page/Tom_Cruise start from the second sentence of
> related Wikipedia articles.
>
> Thanks,
> Alessio@ComplexityIntelligence
>
>
>
> -------- Original Message --------
>
> Subject: Re: [Dbpedia-discussion] Italian short / long abstract problem
> From: Sebastian Hellmann <[email protected]>
> Date: Tue, September 27, 2011 12:26 am
> To: [email protected]
> Cc: Piero Molino <[email protected]>,
> [email protected]
>
> Hello all,
> I think the first important step is to produce a test case.
> Can you provide cases, where the extraction fails?
>
> The best would be to have one file per article, that contains the original
> Wikipedia source and one file that contains the expected output.
>
> This will be a good basis to make regression test case in the future.
>
> After you provided these test cases, there are two choices:
> 1. you can wait until the next Wikipedia dump in 6 months, I guess that is
> the time, when we will try and fix the problem based on the test case you
> provided.
> 2. try to fix it yourself (with a little help of us) and make a new Italian
> dump (or the abstract data sets), which we will use to replace the current
> one. (This would probably be much faster and you would learn how to
> produce/tune the DBpedia data)
>
> All the best,
> Sebastian
>
>
>
>
> On 09/23/2011 09:45 AM, [email protected] wrote:
>
> Hi Piero,
>
> This seems to be a very common problem for the Italian dataset(s). I've
> done
> some random tests, and more than 50% of abstracts are messed (this is very
> random test so the numbers are only indicative).
>
> Just for example:
>
> http://dbpedia.org/page/Vasco_Rossi
> http://dbpedia.org/page/Gina_Lollobrigida
> http://dbpedia.org/page/Tom_Cruise
>
> and many others. Another problem 'Ornella Muti' is missed.
>
> As you can see, the problem is only for Italian, other language
> abstracts are ok.
>
> Please let me know. I can help to fix that.
>
> Alessio
>
>
> -------- Original Message --------
> Subject: Re: [Dbpedia-discussion] Italian short / long abstract problem
> From: Piero Molino <[email protected]>
> Date: Thu, September 22, 2011 10:12 am
> To: Sebastian Hellmann <[email protected]>
> Cc: [email protected],
> [email protected]
>
> Hello Alessio,
>
> the fat that the abstract starts from the second sentence is probably due
> to the fact that many first sentences are generated from the templates in
> italian wikipedia.
> A clear example is the bio template for peolpe.
> So please take a look to the original wikipedia page source and see if this
> is the problem. If not, please give us some examples of the messed
> abstracts.
>
> Regards,
> Piero Molino
>
>
>
> Il giorno 22/set/2011, alle ore 19:06, Sebastian Hellmann ha scritto:
>
> Dear Alessio,
> Sorry, but this is actually not on the top of our Todo list.
> We could assist you a little in fixing the problem.
> Would you be willing to try it?
> Sebastian
>
> On 09/22/2011 02:02 PM, [email protected] wrote:
>
> Hello,
>
> In DBPedia 3.7, a big amunt of long and short abstract for the Italian
> language are messed.
> They start from the second sentence of the Wikipedia article, skipping the
> first one, so the
> abstract as a whole is of a little use as the subject is often unclear.
>
> Is possible to fix that problem ?
>
> Alessio
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common
> sense.http://p.sf.net/sfu/splunk-d2dcopy1
>
>
> _______________________________________________
> Dbpedia-discussion mailing
> [email protected]https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
>
> http://p.sf.net/sfu/splunk-d2dcopy1_______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion