Hi Gaurav,
the simplest way to filter out unmodified pages is probably to add a
filter in ExtractionJob.scala [1]. We don't yet have configurable
filters, so you will have to modify the source code. You basically
have to change this line:
if (namespaces.contains(page.title.namespace)) {
to something like
if (page.timestamp >= minimalTimestamp &&
namespaces.contains(page.title.namespace)) {
And of course you have to add boilerplate code that reads
minimalTimestamp from the config file and passes it on to
ExtractionJob.
Cheers,
JC
[1]
https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ExtractionJob.scala
On Thu, Mar 7, 2013 at 7:47 AM, gaurav pant <[email protected]> wrote:
> Hi All,
>
> Thanks Dimitris for your help..
>
> I also want one more confirmation from you.
>
> I just gone through the code of InfoboxExtractor. There it seems me that
> code is written to process data page by page.(<page>..</page>). If i will
> remove all those pages from "page-article" dump using some perl/python
> script and than apply Infobox extraction or Abstract extraction than we will
> get only updated triplets as output like DBpedia Live for English.
>
>
> Please correct me if I am wrong.
>
> Thanks
>
>
> On Wed, Mar 6, 2013 at 5:51 PM, Dimitris Kontokostas <[email protected]>
> wrote:
>>
>> Hi Guarav,
>>
>> You are correct!
>> Cheers,
>> Dimitris
>>
>>
>> On Wed, Mar 6, 2013 at 2:05 PM, gaurav pant <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> greeting for the day..!
>>>
>>> I have extracted below lines from one of the "pages-articles" file
>>> available at
>>> "http://en.wikipedia.org/wiki/Wikipedia:Database_download#Other_languages".
>>> If I am not wrong than below red marked line denotes is the last modified
>>> timestamp of the page. Please correct me if I am wrong...!
>>>
>>> "<page>
>>> <title>Alan Smithee</title>
>>> <ns>0</ns>
>>> <id>1</id>
>>> <revision>
>>> <id>114215698</id>
>>> <parentid>114215658</parentid>
>>> <timestamp>2013-02-14T21:00:17Z</timestamp>
>>> <contributor>
>>> <ip>2003:58:A507:6A01:1C37:DB74:A237:E121</ip>
>>> </contributor>
>>> <comment>/* Entstehung */</comment>
>>> "
>>>
>>> --
>>> Regards
>>> Gaurav Pant
>>> +91-7709196607,+91-9405757794
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
>>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
>>> endpoint security space. For insight on selecting the right partner to
>>> tackle endpoint security challenges, access the full report.
>>> http://p.sf.net/sfu/symantec-dev2dev
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>
>>
>>
>>
>> --
>> Kontokostas Dimitris
>
>
>
>
> --
> Regards
> Gaurav Pant
> +91-7709196607,+91-9405757794
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion