Hi Gaurav,
Where did you read that the bulk download is only supported for English and
German ? I tried the italian API endpoint and it works okie.
http://it.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=India<http://it.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp%7Ccontent&titles=India>
Regards
Amit
From: gaurav pant <[email protected]<mailto:[email protected]>>
Date: Thursday, March 7, 2013 1:04 PM
To: Amit Kumar <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [Dbpedia-discussion] page article has last modified timestamp
Hi All/Amit,
@Amit- Thanks for throwing light to new things..there is some API too to get
updated pages in bulk.
My requirement is to get all those pages which are being updated within certain
interval. It is also ok if I can get the updated page dump on daily basis.Than
I will process these page-dumps using dbpedia-extractor to extract infobox and
abstract.Hence i have current infobox and abstract data.
As you have mentioned that "Wikipedia's mediawiki software gives you an api to
download these pages in bulk ". Than I can download the API and use it to get
the dump of latest updated page dump.
I searched the same.but it seems it only support english & german. But I want
API for all the languages(almost 6-7 languages).
Please help for the same and let me know if there is any such API using which I
can get updated page dump for different languages.
Otherwise I am going to use my approach.
suppose I have XML dump
"<page>
<title>AssistiveTechnology</title>
<ns>0</ns>
<id>23</id>
<redirect title="Assistive technology" />
<revision>
<id>74466798</id>
<parentid>15898957</parentid>
<timestamp>2006-09-08T04:17:00Z</timestamp>
<contributor>
<username>Rory096</username>
<id>750223</id>
</contributor>
<comment>cat rd</comment>
<text xml:space="preserve">#REDIRECT [[Assistive_technology]] {{R from
CamelCase}}</text>
<sha1>izyyjg1zanv4ett6ox75bxxq88owztd</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
<page>
<title>AmoeboidTaxa</title>
<ns>0</ns>
<id>24</id>
<redirect title="Amoeboid" />
<revision>
<id>74466889</id>
<parentid>15898958</parentid>
<timestamp>2013-09-08T04:17:51Z</timestamp>
<contributor>
<username>Rory096</username>
<id>750223</id>
</contributor>
<comment>cat rd</comment>
<text xml:space="preserve">#REDIRECT [[Amoeboid]] {{R from
CamelCase}}</text>
<sha1>k84gqaug0izzy1ber1dq8bogr2a6txa</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
"
I will process this and remove all the things in between "<page>..</page> where
title is "AssistiveTechnology" because the timestamp is not having "2013" as
year.And than i will give entire page-article xml to parse.
On Thu, Mar 7, 2013 at 12:25 PM, Amit Kumar
<[email protected]<mailto:[email protected]>> wrote:
Hi Gaurav,
I don't know your exact use case but here's what we do. There is an IRC channel
where wikipedia continuously lists the pages as and when it changes. We listen
to the irc channel and every hour make a list of unique pages that changed.
Wikipedia's mediawiki software gives you an api to download these pages in bulk
, it looks like this
"http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=<http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp%7Ccontent&titles=>"
You can download these pages and put it in the same format as the full dump
download by appending the wikipedia namespace list( you can get the list from
"http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content<http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp%7Ccontent>".
There after you can put the file in the same location as the full dump and
evoke the extraction code. It works as expected.
Regards
Amit
From: gaurav pant <[email protected]<mailto:[email protected]>>
Date: Thursday, March 7, 2013 12:17 PM
To: Dimitris Kontokostas <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [Dbpedia-discussion] page article has last modified timestamp
Hi All,
Thanks Dimitris for your help..
I also want one more confirmation from you.
I just gone through the code of InfoboxExtractor. There it seems me that code
is written to process data page by page.(<page>..</page>). If i will remove all
those pages from "page-article" dump using some perl/python script and than
apply Infobox extraction or Abstract extraction than we will get only updated
triplets as output like DBpedia Live for English.
Please correct me if I am wrong.
Thanks
--
Regards
Gaurav Pant
+91-7709196607,+91-9405757794
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion