Re: [Dbpedia-discussion] page article has last modified timestamp

gaurav pant Thu, 07 Mar 2013 00:48:48 -0800

Hi Dimitris,

Actually In my previous mail I want to know if there is any such API using
which I can get dump of updated page-article pages so that out of them I
can generate live dump file(nt,ttl) for other language than english?


On Thu, Mar 7, 2013 at 1:57 PM, Dimitris Kontokostas <[email protected]>wrote:

> Hi all,
>
> This is exactly what DBpedia Live is doing. We have "Feeders" that place
> articles in a process queue, extract triples out of them and then update a
> triple store.
> For now we mainly use OAI-PMH for our feeds but, we could easily add a new
> IRC-Feeder for Amit's needs.
>
> It depends on what you want, if you want dump files (nt, ttl) then feeding
> custom xml dumps to the dump module will be simpler, if on the other hand
> you want an up to date triple store then the live module will suit your
> needs better
>
> Best,
> Dimitris
>
>
> On Thu, Mar 7, 2013 at 9:34 AM, gaurav pant <[email protected]> wrote:
>
>> Hi All/Amit,
>>
>> @Amit- Thanks for throwing light to new things..there is some API too to
>> get updated pages in bulk.
>>
>> My requirement is to get all those pages which are being updated within
>> certain interval. It is also ok if I can get the updated page dump on daily
>> basis.Than I will process these page-dumps using dbpedia-extractor to
>> extract infobox and abstract.Hence i have current infobox and abstract data.
>>
>>
>> As you have mentioned that "Wikipedia's mediawiki software gives you an
>> api to download these pages in bulk ". Than I can download the API and use
>> it to get the dump of latest updated page dump.
>>
>> I searched the same.but it seems it only support english & german. But I
>> want API for all the languages(almost 6-7 languages).
>>
>> Please help for the same and let me know if there is any such API using
>> which I can get updated page dump for different languages.
>>
>> Otherwise I am going to use my approach.
>>
>> suppose I have XML dump
>> "<page>
>>     <title>AssistiveTechnology</title>
>>     <ns>0</ns>
>>     <id>23</id>
>>     <redirect title="Assistive technology" />
>>     <revision>
>>       <id>74466798</id>
>>       <parentid>15898957</parentid>
>>       *<timestamp>2006-09-08T04:17:00Z</timestamp>*
>>       <contributor>
>>         <username>Rory096</username>
>>         <id>750223</id>
>>       </contributor>
>>       <comment>cat rd</comment>
>>       <text xml:space="preserve">#REDIRECT [[Assistive_technology]] {{R
>> from CamelCase}}</text>
>>       <sha1>izyyjg1zanv4ett6ox75bxxq88owztd</sha1>
>>       <model>wikitext</model>
>>       <format>text/x-wiki</format>
>>     </revision>
>>   </page>
>>   <page>
>>     <title>AmoeboidTaxa</title>
>>     <ns>0</ns>
>>     <id>24</id>
>>     <redirect title="Amoeboid" />
>>     <revision>
>>       <id>74466889</id>
>>       <parentid>15898958</parentid>
>>    *   <timestamp>2013-09-08T04:17:51Z</timestamp>*
>>       <contributor>
>>         <username>Rory096</username>
>>         <id>750223</id>
>>       </contributor>
>>       <comment>cat rd</comment>
>>       <text xml:space="preserve">#REDIRECT [[Amoeboid]] {{R from
>> CamelCase}}</text>
>>       <sha1>k84gqaug0izzy1ber1dq8bogr2a6txa</sha1>
>>       <model>wikitext</model>
>>       <format>text/x-wiki</format>
>>     </revision>
>>   </page>
>> "
>> I will process this and remove all the things in between "<page>..</page>
>> where title is "AssistiveTechnology" because the timestamp is not having
>> "2013" as year.And than i will give entire page-article xml to parse.
>>
>>
>>
>> On Thu, Mar 7, 2013 at 12:25 PM, Amit Kumar <[email protected]>wrote:
>>
>>>  Hi Gaurav,
>>> I don't know your exact use case but here's what we do. There is an IRC
>>> channel where wikipedia continuously lists the pages as and when it
>>> changes. We listen to the irc channel and every hour make a list of unique
>>> pages that changed. Wikipedia's mediawiki software gives you an api to
>>> download these pages in bulk  , it looks like this "
>>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=
>>> "
>>>
>>>  You can download these pages and put it in the same format as the full
>>> dump download by appending the  wikipedia namespace list( you can get the
>>> list from "
>>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content".
>>> There after you can put the file in the  same location as the full dump and
>>> evoke the extraction code. It works as expected.
>>>
>>>
>>>  Regards
>>> Amit
>>>
>>>
>>>
>>>   From: gaurav pant <[email protected]>
>>> Date: Thursday, March 7, 2013 12:17 PM
>>> To: Dimitris Kontokostas <[email protected]>, "
>>> [email protected]" <
>>> [email protected]>
>>>
>>> Subject: Re: [Dbpedia-discussion] page article has last modified
>>> timestamp
>>>
>>>  Hi All,
>>>
>>> Thanks Dimitris for your help..
>>>
>>> I also want one more confirmation from you.
>>>
>>> I just gone through the code of InfoboxExtractor. There it seems me that
>>> code is written to process data page by page.(<page>..</page>). If i will
>>> remove all those pages from "page-article" dump using some perl/python
>>> script and than apply Infobox extraction or Abstract extraction than we
>>> will get only updated triplets as output like DBpedia Live for English.
>>>
>>> Please correct me if I am wrong.
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Regards
>> Gaurav Pant
>> +91-7709196607,+91-9405757794
>>
>>
>> ------------------------------------------------------------------------------
>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
>> endpoint security space. For insight on selecting the right partner to
>> tackle endpoint security challenges, access the full report.
>> http://p.sf.net/sfu/symantec-dev2dev
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>
>
> --
> Kontokostas Dimitris
>



-- 
Regards
Gaurav Pant
+91-7709196607,+91-9405757794

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] page article has last modified timestamp

Reply via email to