Hi Dimitris,
Actually In my previous mail I want to know if there is any such API using
which I can get dump of updated page-article pages so that out of them I
can generate live dump file(nt,ttl) for other language than english?
On Thu, Mar 7, 2013 at 1:57 PM, Dimitris Kontokostas <[email protected]>wrote:
> Hi all,
>
> This is exactly what DBpedia Live is doing. We have "Feeders" that place
> articles in a process queue, extract triples out of them and then update a
> triple store.
> For now we mainly use OAI-PMH for our feeds but, we could easily add a new
> IRC-Feeder for Amit's needs.
>
> It depends on what you want, if you want dump files (nt, ttl) then feeding
> custom xml dumps to the dump module will be simpler, if on the other hand
> you want an up to date triple store then the live module will suit your
> needs better
>
> Best,
> Dimitris
>
>
> On Thu, Mar 7, 2013 at 9:34 AM, gaurav pant <[email protected]> wrote:
>
>> Hi All/Amit,
>>
>> @Amit- Thanks for throwing light to new things..there is some API too to
>> get updated pages in bulk.
>>
>> My requirement is to get all those pages which are being updated within
>> certain interval. It is also ok if I can get the updated page dump on daily
>> basis.Than I will process these page-dumps using dbpedia-extractor to
>> extract infobox and abstract.Hence i have current infobox and abstract data.
>>
>>
>> As you have mentioned that "Wikipedia's mediawiki software gives you an
>> api to download these pages in bulk ". Than I can download the API and use
>> it to get the dump of latest updated page dump.
>>
>> I searched the same.but it seems it only support english & german. But I
>> want API for all the languages(almost 6-7 languages).
>>
>> Please help for the same and let me know if there is any such API using
>> which I can get updated page dump for different languages.
>>
>> Otherwise I am going to use my approach.
>>
>> suppose I have XML dump
>> "<page>
>> <title>AssistiveTechnology</title>
>> <ns>0</ns>
>> <id>23</id>
>> <redirect title="Assistive technology" />
>> <revision>
>> <id>74466798</id>
>> <parentid>15898957</parentid>
>> *<timestamp>2006-09-08T04:17:00Z</timestamp>*
>> <contributor>
>> <username>Rory096</username>
>> <id>750223</id>
>> </contributor>
>> <comment>cat rd</comment>
>> <text xml:space="preserve">#REDIRECT [[Assistive_technology]] {{R
>> from CamelCase}}</text>
>> <sha1>izyyjg1zanv4ett6ox75bxxq88owztd</sha1>
>> <model>wikitext</model>
>> <format>text/x-wiki</format>
>> </revision>
>> </page>
>> <page>
>> <title>AmoeboidTaxa</title>
>> <ns>0</ns>
>> <id>24</id>
>> <redirect title="Amoeboid" />
>> <revision>
>> <id>74466889</id>
>> <parentid>15898958</parentid>
>> * <timestamp>2013-09-08T04:17:51Z</timestamp>*
>> <contributor>
>> <username>Rory096</username>
>> <id>750223</id>
>> </contributor>
>> <comment>cat rd</comment>
>> <text xml:space="preserve">#REDIRECT [[Amoeboid]] {{R from
>> CamelCase}}</text>
>> <sha1>k84gqaug0izzy1ber1dq8bogr2a6txa</sha1>
>> <model>wikitext</model>
>> <format>text/x-wiki</format>
>> </revision>
>> </page>
>> "
>> I will process this and remove all the things in between "<page>..</page>
>> where title is "AssistiveTechnology" because the timestamp is not having
>> "2013" as year.And than i will give entire page-article xml to parse.
>>
>>
>>
>> On Thu, Mar 7, 2013 at 12:25 PM, Amit Kumar <[email protected]>wrote:
>>
>>> Hi Gaurav,
>>> I don't know your exact use case but here's what we do. There is an IRC
>>> channel where wikipedia continuously lists the pages as and when it
>>> changes. We listen to the irc channel and every hour make a list of unique
>>> pages that changed. Wikipedia's mediawiki software gives you an api to
>>> download these pages in bulk , it looks like this "
>>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content&titles=
>>> "
>>>
>>> You can download these pages and put it in the same format as the full
>>> dump download by appending the wikipedia namespace list( you can get the
>>> list from "
>>> http://en.wikipedia.org/w/api.php?action=query&export&exportnowrap&prop=revisions&rvprop=timestamp|content".
>>> There after you can put the file in the same location as the full dump and
>>> evoke the extraction code. It works as expected.
>>>
>>>
>>> Regards
>>> Amit
>>>
>>>
>>>
>>> From: gaurav pant <[email protected]>
>>> Date: Thursday, March 7, 2013 12:17 PM
>>> To: Dimitris Kontokostas <[email protected]>, "
>>> [email protected]" <
>>> [email protected]>
>>>
>>> Subject: Re: [Dbpedia-discussion] page article has last modified
>>> timestamp
>>>
>>> Hi All,
>>>
>>> Thanks Dimitris for your help..
>>>
>>> I also want one more confirmation from you.
>>>
>>> I just gone through the code of InfoboxExtractor. There it seems me that
>>> code is written to process data page by page.(<page>..</page>). If i will
>>> remove all those pages from "page-article" dump using some perl/python
>>> script and than apply Infobox extraction or Abstract extraction than we
>>> will get only updated triplets as output like DBpedia Live for English.
>>>
>>> Please correct me if I am wrong.
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Regards
>> Gaurav Pant
>> +91-7709196607,+91-9405757794
>>
>>
>> ------------------------------------------------------------------------------
>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
>> endpoint security space. For insight on selecting the right partner to
>> tackle endpoint security challenges, access the full report.
>> http://p.sf.net/sfu/symantec-dev2dev
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>
>
> --
> Kontokostas Dimitris
>
--
Regards
Gaurav Pant
+91-7709196607,+91-9405757794
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion