Re: [Dbpedia-discussion] page article has last modified timestamp

Amit Kumar Thu, 07 Mar 2013 01:11:32 -0800

Hi Gaurav,
Where did you read that the bulk download is only supported for English and 
German ? I tried the italian API endpoint and it works okie.
http://it.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=India<http://it.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp%7Ccontent&titles=India>

Regards
Amit

From: gaurav pant <[email protected]<mailto:[email protected]>>
Date: Thursday, March 7, 2013 1:04 PM
To: Amit Kumar <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>"

<[email protected]<mailto:[email protected]>>
Subject: Re: [Dbpedia-discussion] page article has last modified timestamp

Hi All/Amit,

@Amit- Thanks for throwing light to new things..there is some API too to get 
updated pages in bulk.

My requirement is to get all those pages which are being updated within certain 
interval. It is also ok if I can get the updated page dump on daily basis.Than 
I will process these page-dumps using dbpedia-extractor to extract infobox and 
abstract.Hence i have current infobox and abstract data.

As you have mentioned that "Wikipedia's mediawiki software gives you an api to 
download these pages in bulk ". Than I can download the API and use it to get 
the dump of latest updated page dump.

I searched the same.but it seems it only support english & german. But I want 
API for all the languages(almost 6-7 languages).

Please help for the same and let me know if there is any such API using which I 
can get updated page dump for different languages.

Otherwise I am going to use my approach.

suppose I have XML dump
"<page>
    <title>AssistiveTechnology</title>
    <ns>0</ns>
    <id>23</id>
    <redirect title="Assistive technology" />
    <revision>
      <id>74466798</id>
      <parentid>15898957</parentid>
      <timestamp>2006-09-08T04:17:00Z</timestamp>
      <contributor>
        <username>Rory096</username>
        <id>750223</id>
      </contributor>
      <comment>cat rd</comment>
      <text xml:space="preserve">#REDIRECT [[Assistive_technology]] {{R from 
CamelCase}}</text>
      <sha1>izyyjg1zanv4ett6ox75bxxq88owztd</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>
  <page>
    <title>AmoeboidTaxa</title>
    <ns>0</ns>
    <id>24</id>
    <redirect title="Amoeboid" />
    <revision>
      <id>74466889</id>
      <parentid>15898958</parentid>
      <timestamp>2013-09-08T04:17:51Z</timestamp>
      <contributor>
        <username>Rory096</username>
        <id>750223</id>
      </contributor>
      <comment>cat rd</comment>
      <text xml:space="preserve">#REDIRECT [[Amoeboid]] {{R from 
CamelCase}}</text>
      <sha1>k84gqaug0izzy1ber1dq8bogr2a6txa</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>
"
I will process this and remove all the things in between "<page>..</page> where 
title is "AssistiveTechnology" because the timestamp is not having "2013" as 
year.And than i will give entire page-article xml to parse.

On Thu, Mar 7, 2013 at 12:25 PM, Amit Kumar 
<[email protected]<mailto:[email protected]>> wrote:
Hi Gaurav,
I don't know your exact use case but here's what we do. There is an IRC channel 
where wikipedia continuously lists the pages as and when it changes. We listen 
to the irc channel and every hour make a list of unique pages that changed. 
Wikipedia's mediawiki software gives you an api to download these pages in bulk 
 , it looks like this 
"http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=<http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp%7Ccontent&titles=>"

You can download these pages and put it in the same format as the full dump 
download by appending the  wikipedia namespace list( you can get the list from 
"http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content<http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp%7Ccontent>".
 There after you can put the file in the  same location as the full dump and 
evoke the extraction code. It works as expected.

Regards
Amit

From: gaurav pant <[email protected]<mailto:[email protected]>>
Date: Thursday, March 7, 2013 12:17 PM
To: Dimitris Kontokostas <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>"

<[email protected]<mailto:[email protected]>>

Subject: Re: [Dbpedia-discussion] page article has last modified timestamp

Hi All,

Thanks Dimitris for your help..

I also want one more confirmation from you.

I just gone through the code of InfoboxExtractor. There it seems me that code 
is written to process data page by page.(<page>..</page>). If i will remove all 
those pages from "page-article" dump using some perl/python script and than 
apply Infobox extraction or Abstract extraction than we will get only updated 
triplets as output like DBpedia Live for English.

Please correct me if I am wrong.

Thanks

--
Regards
Gaurav Pant
+91-7709196607,+91-9405757794

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] page article has last modified timestamp

Reply via email to