Re: [Dbpedia-discussion] page article has last modified timestamp

Dimitris Kontokostas Thu, 07 Mar 2013 00:59:58 -0800

You could use Amit suggestion for getting the modified articles and then
this methods:
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/sources/WikiSource.scala
to get a copy of them from Wikipedia
Otherwise you can download the monthly dumps and remove unmodified articles
as you suggested



On Thu, Mar 7, 2013 at 10:47 AM, gaurav pant <[email protected]> wrote:

> Hi Dimitris,
>
> Actually In my previous mail I want to know if there is any such API using
> which I can get dump of updated page-article pages so that out of them I
> can generate live dump file(nt,ttl) for other language than english?
>
>
> On Thu, Mar 7, 2013 at 1:57 PM, Dimitris Kontokostas <[email protected]>wrote:
>
>> Hi all,
>>
>> This is exactly what DBpedia Live is doing. We have "Feeders" that place
>> articles in a process queue, extract triples out of them and then update a
>> triple store.
>> For now we mainly use OAI-PMH for our feeds but, we could easily add a
>> new IRC-Feeder for Amit's needs.
>>
>> It depends on what you want, if you want dump files (nt, ttl) then
>> feeding custom xml dumps to the dump module will be simpler, if on the
>> other hand you want an up to date triple store then the live module will
>> suit your needs better
>>
>> Best,
>> Dimitris
>>
>>
>> On Thu, Mar 7, 2013 at 9:34 AM, gaurav pant <[email protected]> wrote:
>>
>>> Hi All/Amit,
>>>
>>> @Amit- Thanks for throwing light to new things..there is some API too to
>>> get updated pages in bulk.
>>>
>>> My requirement is to get all those pages which are being updated within
>>> certain interval. It is also ok if I can get the updated page dump on daily
>>> basis.Than I will process these page-dumps using dbpedia-extractor to
>>> extract infobox and abstract.Hence i have current infobox and abstract data.
>>>
>>>
>>> As you have mentioned that "Wikipedia's mediawiki software gives you an
>>> api to download these pages in bulk ". Than I can download the API and use
>>> it to get the dump of latest updated page dump.
>>>
>>> I searched the same.but it seems it only support english & german. But I
>>> want API for all the languages(almost 6-7 languages).
>>>
>>> Please help for the same and let me know if there is any such API using
>>> which I can get updated page dump for different languages.
>>>
>>> Otherwise I am going to use my approach.
>>>
>>> suppose I have XML dump
>>> "<page>
>>>     <title>AssistiveTechnology</title>
>>>     <ns>0</ns>
>>>     <id>23</id>
>>>     <redirect title="Assistive technology" />
>>>     <revision>
>>>       <id>74466798</id>
>>>       <parentid>15898957</parentid>
>>>       *<timestamp>2006-09-08T04:17:00Z</timestamp>*
>>>       <contributor>
>>>         <username>Rory096</username>
>>>         <id>750223</id>
>>>       </contributor>
>>>       <comment>cat rd</comment>
>>>       <text xml:space="preserve">#REDIRECT [[Assistive_technology]] {{R
>>> from CamelCase}}</text>
>>>       <sha1>izyyjg1zanv4ett6ox75bxxq88owztd</sha1>
>>>       <model>wikitext</model>
>>>       <format>text/x-wiki</format>
>>>     </revision>
>>>   </page>
>>>   <page>
>>>     <title>AmoeboidTaxa</title>
>>>     <ns>0</ns>
>>>     <id>24</id>
>>>     <redirect title="Amoeboid" />
>>>     <revision>
>>>       <id>74466889</id>
>>>       <parentid>15898958</parentid>
>>>    *   <timestamp>2013-09-08T04:17:51Z</timestamp>*
>>>       <contributor>
>>>         <username>Rory096</username>
>>>         <id>750223</id>
>>>       </contributor>
>>>       <comment>cat rd</comment>
>>>       <text xml:space="preserve">#REDIRECT [[Amoeboid]] {{R from
>>> CamelCase}}</text>
>>>       <sha1>k84gqaug0izzy1ber1dq8bogr2a6txa</sha1>
>>>       <model>wikitext</model>
>>>       <format>text/x-wiki</format>
>>>     </revision>
>>>   </page>
>>> "
>>> I will process this and remove all the things in between
>>> "<page>..</page> where title is "AssistiveTechnology" because the timestamp
>>> is not having "2013" as year.And than i will give entire page-article xml
>>> to parse.
>>>
>>>
>>>
>>> On Thu, Mar 7, 2013 at 12:25 PM, Amit Kumar <[email protected]>wrote:
>>>
>>>>  Hi Gaurav,
>>>> I don't know your exact use case but here's what we do. There is an IRC
>>>> channel where wikipedia continuously lists the pages as and when it
>>>> changes. We listen to the irc channel and every hour make a list of unique
>>>> pages that changed. Wikipedia's mediawiki software gives you an api to
>>>> download these pages in bulk  , it looks like this "
>>>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=
>>>> "
>>>>
>>>>  You can download these pages and put it in the same format as the
>>>> full dump download by appending the  wikipedia namespace list( you can get
>>>> the list from "
>>>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content".
>>>> There after you can put the file in the  same location as the full dump and
>>>> evoke the extraction code. It works as expected.
>>>>
>>>>
>>>>  Regards
>>>> Amit
>>>>
>>>>
>>>>
>>>>   From: gaurav pant <[email protected]>
>>>> Date: Thursday, March 7, 2013 12:17 PM
>>>> To: Dimitris Kontokostas <[email protected]>, "
>>>> [email protected]" <
>>>> [email protected]>
>>>>
>>>> Subject: Re: [Dbpedia-discussion] page article has last modified
>>>> timestamp
>>>>
>>>>  Hi All,
>>>>
>>>> Thanks Dimitris for your help..
>>>>
>>>> I also want one more confirmation from you.
>>>>
>>>> I just gone through the code of InfoboxExtractor. There it seems me
>>>> that code is written to process data page by page.(<page>..</page>). If i
>>>> will remove all those pages from "page-article" dump using some perl/python
>>>> script and than apply Infobox extraction or Abstract extraction than we
>>>> will get only updated triplets as output like DBpedia Live for English.
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Regards
>>> Gaurav Pant
>>> +91-7709196607,+91-9405757794
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
>>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
>>> endpoint security space. For insight on selecting the right partner to
>>> tackle endpoint security challenges, access the full report.
>>> http://p.sf.net/sfu/symantec-dev2dev
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>
>
> --
> Regards
> Gaurav Pant
> +91-7709196607,+91-9405757794
>



-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] page article has last modified timestamp

Reply via email to