Regarding scraping Wikipedia HTML pages:It's a different type of extraction but 
all the relevant structured data is there (e.g. infobox template name, 
attribute names and values, categories, etc.) and all Wiki templates have 
already been interpreted into user-friendly plain-text values, so you don't 
have to.
Regarding delegating to MediaWiki:Wikipedia's MediaWiki Special API has a 1 QPS 
rule but you can request more than 1 page per call and you only need to do it 
for pages that have been added or modified. So, depending on your update 
frequency, it may be do-able. Otherwise you will have to install and maintain 
your own MediaWiki cluster as a Wikipedia mirror, so you can hit it as hard as 
you need.
Nicolas.



     On Wednesday, February 18, 2015 3:45 PM, Mandar Rahurkar 
<rahur...@gmail.com> wrote:
   

 Thanks Nicolas ! :)
1. Scraping rendered wikipedia html pages seems like it would be noisy in terms 
of data quality. Isn't that so?2. If we delegate to MediaWiki API, is this 
option scalable if we had to parse the wikidump on daily basis?
thanks,Mandar
On Wed, Feb 18, 2015 at 1:11 PM, Nicolas Torzec <torz...@yahoo-inc.com> wrote:

Hi Mandar :)
DBpedia does not handle nested templates. It may work for some specific 
(simple-enough) templates but it is in no way generalized.

That's why consumer-grade projects consuming Wikipedia data either:1) Scrape 
Wikipedia HTML pages directly: i.e. template interpretation is done by 
MediaWiki, on wikipedia.com or on dedicated Wikipedia mirrors.2) Set up their 
own Wikipedia extraction framework, which may interpret templates directly or 
delegate to MediaWiki using its API.
Nicolas.




 

     On Wednesday, February 18, 2015 10:56 AM, Mandar Rahurkar 
<rahur...@gmail.com> wrote:
   

 Thanks Guys for  your comments ! Release data information for April Love 
(film) is availablehttp://dbpedia.org/page/April_Love_(film)

but not for http://dbpedia.org/page/Actrius
And if you examine the wikipedia page, they both seem to use nested 
template:http://en.wikipedia.org/w/index.php?title=April_Love_(film)&action=edit

So maybe this is more than one issue?
thanks,Mandar



On Wed, Feb 18, 2015 at 9:22 AM, Alexandru Todor <to...@inf.fu-berlin.de> wrote:

Hi Vladimir, Mandar,
The mappings extractor can't handle nested templates: 
http://sourceforge.net/p/dbpedia/mailman/message/32867924/ .@Dimitris : I know 
this is on your to do list, any progress so far ?
Cheers,Alexandru
On Wed, Feb 18, 2015 at 5:49 PM, Vladimir Alexiev 
<vladimir.alex...@ontotext.com> wrote:

Hi Mandar!

Run these queries on http://yasgui.org/, selecting http://dbpedia.org/sparql as 
endpoint.

First check the raw property dbo:released:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select * {?x a dbo:Film; dbp:released ?rel
    filter exists {?x rdfs:label ?lab filter(strstarts(?lab,"Act"))}}
order by ?x limit 100

As you can see many movies have it, but not Actrius.
So Volha is right, the problem is that in that movie it's not a plain date.

> How were you able to extract that information?

It's in https://en.wikipedia.org/w/index.php?title=Actrius&action=edit:
   | release = {{Film date|1996|||}}

I tried to make a mapping: 
http://mappings.dbpedia.org/index.php/Mapping_en:Film_date
to extract release year and location (there can be several).

But it doesn't extract anything. Maybe templates INSIDE template fields are not 
used for extraction?
Issue: https://github.com/dbpedia/mappings-tracker/issues/46
Test cases: http://mappings.dbpedia.org/index.php/Mapping_en_talk:Film_date

If that's the case, we could map it to another date template here:
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/config/dataparser/DateTimeParserConfig.scala#L97
But Volha, can it extract SEVERAL dates from one template?



------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion





------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion


    



   
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to