Re: [Dbpedia-discussion] page article has last modified timestamp

Dimitris Kontokostas Thu, 07 Mar 2013 00:29:24 -0800

Hi all,

This is exactly what DBpedia Live is doing. We have "Feeders" that place
articles in a process queue, extract triples out of them and then update a
triple store.
For now we mainly use OAI-PMH for our feeds but, we could easily add a new
IRC-Feeder for Amit's needs.


It depends on what you want, if you want dump files (nt, ttl) then feeding
custom xml dumps to the dump module will be simpler, if on the other hand
you want an up to date triple store then the live module will suit your
needs better

Best,
Dimitris


On Thu, Mar 7, 2013 at 9:34 AM, gaurav pant <[email protected]> wrote:

> Hi All/Amit,
>
> @Amit- Thanks for throwing light to new things..there is some API too to
> get updated pages in bulk.
>
> My requirement is to get all those pages which are being updated within
> certain interval. It is also ok if I can get the updated page dump on daily
> basis.Than I will process these page-dumps using dbpedia-extractor to
> extract infobox and abstract.Hence i have current infobox and abstract data.
>
>
> As you have mentioned that "Wikipedia's mediawiki software gives you an
> api to download these pages in bulk ". Than I can download the API and use
> it to get the dump of latest updated page dump.
>
> I searched the same.but it seems it only support english & german. But I
> want API for all the languages(almost 6-7 languages).
>
> Please help for the same and let me know if there is any such API using
> which I can get updated page dump for different languages.
>
> Otherwise I am going to use my approach.
>
> suppose I have XML dump
> "<page>
>     <title>AssistiveTechnology</title>
>     <ns>0</ns>
>     <id>23</id>
>     <redirect title="Assistive technology" />
>     <revision>
>       <id>74466798</id>
>       <parentid>15898957</parentid>
>       *<timestamp>2006-09-08T04:17:00Z</timestamp>*
>       <contributor>
>         <username>Rory096</username>
>         <id>750223</id>
>       </contributor>
>       <comment>cat rd</comment>
>       <text xml:space="preserve">#REDIRECT [[Assistive_technology]] {{R
> from CamelCase}}</text>
>       <sha1>izyyjg1zanv4ett6ox75bxxq88owztd</sha1>
>       <model>wikitext</model>
>       <format>text/x-wiki</format>
>     </revision>
>   </page>
>   <page>
>     <title>AmoeboidTaxa</title>
>     <ns>0</ns>
>     <id>24</id>
>     <redirect title="Amoeboid" />
>     <revision>
>       <id>74466889</id>
>       <parentid>15898958</parentid>
>    *   <timestamp>2013-09-08T04:17:51Z</timestamp>*
>       <contributor>
>         <username>Rory096</username>
>         <id>750223</id>
>       </contributor>
>       <comment>cat rd</comment>
>       <text xml:space="preserve">#REDIRECT [[Amoeboid]] {{R from
> CamelCase}}</text>
>       <sha1>k84gqaug0izzy1ber1dq8bogr2a6txa</sha1>
>       <model>wikitext</model>
>       <format>text/x-wiki</format>
>     </revision>
>   </page>
> "
> I will process this and remove all the things in between "<page>..</page>
> where title is "AssistiveTechnology" because the timestamp is not having
> "2013" as year.And than i will give entire page-article xml to parse.
>
>
>
> On Thu, Mar 7, 2013 at 12:25 PM, Amit Kumar <[email protected]>wrote:
>
>>  Hi Gaurav,
>> I don't know your exact use case but here's what we do. There is an IRC
>> channel where wikipedia continuously lists the pages as and when it
>> changes. We listen to the irc channel and every hour make a list of unique
>> pages that changed. Wikipedia's mediawiki software gives you an api to
>> download these pages in bulk  , it looks like this "
>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=
>> "
>>
>>  You can download these pages and put it in the same format as the full
>> dump download by appending the  wikipedia namespace list( you can get the
>> list from "
>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content".
>> There after you can put the file in the  same location as the full dump and
>> evoke the extraction code. It works as expected.
>>
>>
>>  Regards
>> Amit
>>
>>
>>
>>   From: gaurav pant <[email protected]>
>> Date: Thursday, March 7, 2013 12:17 PM
>> To: Dimitris Kontokostas <[email protected]>, "
>> [email protected]" <
>> [email protected]>
>>
>> Subject: Re: [Dbpedia-discussion] page article has last modified
>> timestamp
>>
>>  Hi All,
>>
>> Thanks Dimitris for your help..
>>
>> I also want one more confirmation from you.
>>
>> I just gone through the code of InfoboxExtractor. There it seems me that
>> code is written to process data page by page.(<page>..</page>). If i will
>> remove all those pages from "page-article" dump using some perl/python
>> script and than apply Infobox extraction or Abstract extraction than we
>> will get only updated triplets as output like DBpedia Live for English.
>>
>> Please correct me if I am wrong.
>>
>> Thanks
>>
>
>
>
> --
> Regards
> Gaurav Pant
> +91-7709196607,+91-9405757794
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] page article has last modified timestamp

Reply via email to