Regarding scraping Wikipedia HTML pages:It's a different type of extraction but
all the relevant structured data is there (e.g. infobox template name,
attribute names and values, categories, etc.) and all Wiki templates have
already been interpreted into user-friendly plain-text values, so you don't
have to.
Regarding delegating to MediaWiki:Wikipedia's MediaWiki Special API has a 1 QPS
rule but you can request more than 1 page per call and you only need to do it
for pages that have been added or modified. So, depending on your update
frequency, it may be do-able. Otherwise you will have to install and maintain
your own MediaWiki cluster as a Wikipedia mirror, so you can hit it as hard as
you need.
Nicolas.
On Wednesday, February 18, 2015 3:45 PM, Mandar Rahurkar
<rahur...@gmail.com> wrote:
Thanks Nicolas ! :)
1. Scraping rendered wikipedia html pages seems like it would be noisy in terms
of data quality. Isn't that so?2. If we delegate to MediaWiki API, is this
option scalable if we had to parse the wikidump on daily basis?
thanks,Mandar
On Wed, Feb 18, 2015 at 1:11 PM, Nicolas Torzec <torz...@yahoo-inc.com> wrote:
Hi Mandar :)
DBpedia does not handle nested templates. It may work for some specific
(simple-enough) templates but it is in no way generalized.
That's why consumer-grade projects consuming Wikipedia data either:1) Scrape
Wikipedia HTML pages directly: i.e. template interpretation is done by
MediaWiki, on wikipedia.com or on dedicated Wikipedia mirrors.2) Set up their
own Wikipedia extraction framework, which may interpret templates directly or
delegate to MediaWiki using its API.
Nicolas.
On Wednesday, February 18, 2015 10:56 AM, Mandar Rahurkar
<rahur...@gmail.com> wrote:
Thanks Guys for your comments ! Release data information for April Love
(film) is availablehttp://dbpedia.org/page/April_Love_(film)
but not for http://dbpedia.org/page/Actrius
And if you examine the wikipedia page, they both seem to use nested
template:http://en.wikipedia.org/w/index.php?title=April_Love_(film)&action=edit
So maybe this is more than one issue?
thanks,Mandar
On Wed, Feb 18, 2015 at 9:22 AM, Alexandru Todor <to...@inf.fu-berlin.de> wrote:
Hi Vladimir, Mandar,
The mappings extractor can't handle nested templates:
http://sourceforge.net/p/dbpedia/mailman/message/32867924/ .@Dimitris : I know
this is on your to do list, any progress so far ?
Cheers,Alexandru
On Wed, Feb 18, 2015 at 5:49 PM, Vladimir Alexiev
<vladimir.alex...@ontotext.com> wrote:
Hi Mandar!
Run these queries on http://yasgui.org/, selecting http://dbpedia.org/sparql as
endpoint.
First check the raw property dbo:released:
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select * {?x a dbo:Film; dbp:released ?rel
filter exists {?x rdfs:label ?lab filter(strstarts(?lab,"Act"))}}
order by ?x limit 100
As you can see many movies have it, but not Actrius.
So Volha is right, the problem is that in that movie it's not a plain date.
> How were you able to extract that information?
It's in https://en.wikipedia.org/w/index.php?title=Actrius&action=edit:
| release = {{Film date|1996|||}}
I tried to make a mapping:
http://mappings.dbpedia.org/index.php/Mapping_en:Film_date
to extract release year and location (there can be several).
But it doesn't extract anything. Maybe templates INSIDE template fields are not
used for extraction?
Issue: https://github.com/dbpedia/mappings-tracker/issues/46
Test cases: http://mappings.dbpedia.org/index.php/Mapping_en_talk:Film_date
If that's the case, we could map it to another date template here:
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/config/dataparser/DateTimeParserConfig.scala#L97
But Volha, can it extract SEVERAL dates from one template?
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion