Re: [Dbpedia-discussion] Problem with extracted data

Dimitris Kontokostas Mon, 22 Apr 2013 06:24:22 -0700

Hi,

I created a new extractor a few days ago where we get all the templates
used in a page
Maybe this can help with Julien's approach
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ArticleTemplatesExtractor.scala


Cheers,
Dimitris


On Mon, Apr 22, 2013 at 3:37 PM, Julien Plu <
[email protected]> wrote:

> @Jona : If I create a new Scala class here :
> "org.dbpedia.extraction.mappings.fr.PopulationExtractor.scala"
>
> And in if my extraction.default.properties file I write :
> "org.dbpedia.extraction.mappings.fr.PopulationExtractor"
>
> I have a "ClassNotFound" Exception and my class extend "Extractor" and has
> the same name than the file :-(
>
> Best.
>
> Julien.
>
>
> 2013/4/22 Julien Plu <[email protected]>
>
>> Apparently your solution doesn't works because the template "
>> Données/Toulouse/évolution_population<http://fr.wikipedia.org/wiki/Mod%C3%A8le:Donn%C3%A9es/Toulouse/%C3%A9volution_population>"
>> doesn't appear in the among the "
>> dbo:wikiPageUsesTemplate" property values :-(
>>
>>
>> http://data.lirmm.fr/sparql/?default-graph-uri=&query=select+*+where+{%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FToulouse%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Ft}&should-sponge=&format=text%2Fhtml&timeout=0&debug=on<http://data.lirmm.fr/sparql/?default-graph-uri=&query=select+*+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FToulouse%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FwikiPageUsesTemplate%3E+%3Ft%7D&should-sponge=&format=text%2Fhtml&timeout=0&debug=on>
>>
>> Best.
>>
>> Julien.
>>
>>
>> 2013/4/22 Julien Plu <[email protected]>
>>
>>> Hi Julien,
>>>
>>>
>>> >You will store data extracted from the templates pages and then insert
>>> them when you parse the article page ?
>>>
>>> To answer at your question, no, it's not what we have in mind. It's more
>>> like "HomepageExtractor" by example. Create a gz file with only the
>>> population inside.
>>>
>>> But yes I think your solution can work too, need to test it :-)
>>>
>>> Best.
>>>
>>> Julien.
>>>
>>>
>>> 2013/4/22 Julien Cojan <[email protected]>
>>>
>>>> Hi Julien, Jonas,
>>>>
>>>> I just saw your discussion bout externalised templates.
>>>> For information, the property prop-fr:population appears on
>>>> http://fr.dbpedia.org because the template
>>>> Données/Toulouse/évolution_population<http://fr.wikipedia.org/wiki/Mod%C3%A8le:Donn%C3%A9es/Toulouse/%C3%A9volution_population>was
>>>>  not used when I did the last extraction.
>>>>
>>>>
>>>> About the extractor you want to add, I am not sure I understood how you
>>>> want to do.
>>>> You will store data extracted from the templates pages and then insert
>>>> them when you parse the article page ?
>>>> So you need to run the extraction framework twice over the Wikipedia
>>>> dump, the template page may appear after in the dump file.
>>>>
>>>> Wouldn't it be more generic to define some insert/delete SPARQL rules
>>>> to handle this once the extraction process is over ?
>>>> something like :
>>>>
>>>> insert {?s ?p ?v} where {?s dbo:wikiPageUsesTemplate ?t . ?t ?p ?v}
>>>>
>>>> then
>>>>
>>>> delete {?t ?p ?v} where {?s dbo:wikiPageUsesTemplate ?t . ?t ?p ?v}
>>>>
>>>>
>>>> Cheers,
>>>> Julien C.
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *De: *"Julien Plu" <[email protected]>
>>>> *À: *"Jona Christopher Sahnwaldt" <[email protected]>
>>>> *Cc: *[email protected]
>>>> *Envoyé: *Lundi 22 Avril 2013 09:54:59
>>>> *Objet: *Re: [Dbpedia-discussion] Problem with extracted data
>>>>
>>>>
>>>> Ok, I will try to code this in a new package "fr" this week. I have
>>>> just to see how to write an extractor and learning Scala :-D
>>>>
>>>> Best.
>>>>
>>>> Julien.
>>>>
>>>>
>>>> 2013/4/22 Jona Christopher Sahnwaldt <[email protected]>
>>>>
>>>>> Good idea! It probably wouldn't be hard to write a specific extractor
>>>>> for this. Maybe just a few dozen lines.
>>>>>
>>>>> Only problem is, we may soon have dozens or hundreds of such
>>>>> specialized extractors. But we can deal with that. :-)
>>>>>
>>>>> If you want to write that extractor, we would be happy to include it
>>>>> in the extraction framework. Here are some instructions on how you can
>>>>> send a pull request on GitHub:
>>>>>
>>>>> https://github.com/dbpedia/extraction-framework/wiki/Contributing
>>>>>
>>>>> To keep things manageable and since this extractor is only applicable
>>>>> for the French Wikipedia edition, I would suggest you create a new
>>>>> package org.dbpedia.extraction.mappings.fr in
>>>>> extraction-framework/core/src/main/scala. Like many other extractors,
>>>>> this one doesn't really belong in the 'core' module, but the
>>>>> extraction framework is not yet very well modularized, so there's no
>>>>> better place.
>>>>>
>>>>> A minor addition: I guess we should change the syntax in the
>>>>> extraction config files: currently, all extractor class names that *do
>>>>> not contain a dot* are prefixed by "org.dbpedia.extraction.mappings.".
>>>>> Example: "AbstractExtractor" becomes
>>>>> "org.dbpedia.extraction.mappings.AbstractExtractor". If we change that
>>>>> rule and prefix all extractor class names that *start with a dot* by
>>>>> "org.dbpedia.extraction.mappings", then you could write
>>>>> ".fr.PopulationExtractor" in your extraction config file. With the
>>>>> current rule, you would have to write the whole class name
>>>>> "org.dbpedia.extraction.mappings.fr.PopulationExtractor". (Of course,
>>>>> with the new rule, we would have to add a dot to all extractor class
>>>>> names in all config files, but that's no big deal.)
>>>>>
>>>>> Cheers,
>>>>> JC
>>>>>
>>>>> On 21 April 2013 22:35, Julien Plu <
>>>>> [email protected]> wrote:
>>>>> > I thought to the same implementation than you Jona but a little bit
>>>>> > different. Here my steps :
>>>>> >
>>>>> > 1) Parse the XML file and retrieve all the data about these
>>>>> templates. For
>>>>> > example we see a tag "title" with this :
>>>>> >
>>>>> > Modèle:Données/Toulouse/évolution_population
>>>>> >
>>>>> > 2) Extract the last "an" and "pop" values
>>>>> > 3) Put in a file the triples :
>>>>> > <http://fr.dbpedia.org/resource/Toulouse>
>>>>> > <http://fr.dbpedia.org/property/population> number pop^^xsd:integer
>>>>> .
>>>>> > <http://fr.dbpedia.org/resource/Toulouse>
>>>>> > <http://fr.dbpedia.org/property/AnneePopulation> year^^xsd:date .
>>>>> >
>>>>> > And so on, for all these templates. What do you think ?
>>>>> >
>>>>> > I know it's not really generic but it's a good beginning to think
>>>>> after to a
>>>>> > generic solution.
>>>>> >
>>>>> > Best.
>>>>> >
>>>>> > Julien.
>>>>> >
>>>>> >
>>>>> > 2013/4/21 Jona Christopher Sahnwaldt <[email protected]>
>>>>> >>
>>>>> >> Good question. Short answer: No, DBpedia can't handle these
>>>>> templates,
>>>>> >> and it's hard to change that.
>>>>> >>
>>>>> >> It would be nice to do it in a generic way: design a system that
>>>>> >> allows users of the mappings wiki to add rules how such templates
>>>>> >> should be handled in a certain lanuage. Write Scala code that
>>>>> executes
>>>>> >> these rules and parses the template definitions (e.g.
>>>>> >> Modèle:Données/Toulouse/évolution_population) to extract the data
>>>>> and
>>>>> >> store it in memory or in an temporary file. Then during the main
>>>>> >> extraction, when you find a template call like {{Dernière population
>>>>> >> commune de France}}, get the data from storage and generate the
>>>>> >> appropriate triples.
>>>>> >>
>>>>> >> A major effort. Related to
>>>>> >> http://wiki.dbpedia.org/gsoc2013/ideas/CrowdsourceTestsAndRules ,
>>>>> but
>>>>> >> even bigger.
>>>>> >>
>>>>> >> Maybe it would be easier to extend DBpedia such that the framework
>>>>> can
>>>>> >> "execute" template definitions.
>>>>> >>
>>>>> >> Maybe all that is a waste of time because the data will soon move to
>>>>> >> Wikidata. We just don't know how soon: Three months? Three years?
>>>>> >> Never?
>>>>> >>
>>>>> >> JC
>>>>> >>
>>>>> >> On 21 April 2013 22:04, Julien Plu <
>>>>> [email protected]>
>>>>> >> wrote:
>>>>> >> > Thanks Jona for these precisions :-)
>>>>> >> >
>>>>> >> > Another thing, I would like to know if the extraction framework
>>>>> can use
>>>>> >> > the
>>>>> >> > "data templates". I mean some properties values (in french
>>>>> wikipedia for
>>>>> >> > french Settlement) are now replaced by templates, for example :
>>>>> >> >
>>>>> >> > population = {{Dernière population commune de France}} <!-- {{Last
>>>>> >> > population french Settlement}} -->
>>>>> >> >
>>>>> >> > And this data is contained in this kind of pattern :
>>>>> >> >
>>>>> >> > http://fr.wikipedia.fr/wiki/Modèle:Données/Nom de
>>>>> >> > l'article/évolution_population
>>>>> >> >
>>>>> >> > In english :
>>>>> >> >
>>>>> >> > Template:Data/article name/evolution_population
>>>>> >> >
>>>>> >> > By example :
>>>>> >> >
>>>>> >> >
>>>>> http://fr.wikipedia.org/wiki/Modèle:Données/Toulouse/évolution_population
>>>>> >> >
>>>>> >> > It's always the same address pattern. And these templates look
>>>>> like this
>>>>> >> > :
>>>>> >> >
>>>>> >> > <includeonly>{{#switch: {{{1|}}}
>>>>> >> > |an1=1793|pop1=52612
>>>>> >> > |anX=year|popX=number
>>>>> >> > |an=last_year|pop=last_known_number}}</includeonly>
>>>>> >> >
>>>>> >> > These templates are in the XML dump.
>>>>> >> >
>>>>> >> > So it has been added in the extraction framework ? if no, what
>>>>> files I
>>>>> >> > have
>>>>> >> > to modify for including these kind of exceptions ?
>>>>> >> >
>>>>> >> > Best.
>>>>> >> >
>>>>> >> > Julien.
>>>>> >> >
>>>>> >> >
>>>>> >> > 2013/4/21 Jona Christopher Sahnwaldt <[email protected]>
>>>>> >> >>
>>>>> >> >> On 21 April 2013 19:38, Julien Plu
>>>>> >> >> <[email protected]>
>>>>> >> >> wrote:
>>>>> >> >> > Hi,
>>>>> >> >> >
>>>>> >> >> > An idea of what I do wrongly? (see my previous mail below)
>>>>> >> >> >
>>>>> >> >> > Best.
>>>>> >> >> >
>>>>> >> >> > Julien.
>>>>> >> >> >
>>>>> >> >> > From: Julien Plu <[email protected]>
>>>>> >> >> > Date: 2013/4/20
>>>>> >> >> > Subject: Problem with extracted data
>>>>> >> >> > To: "[email protected]"
>>>>> >> >> > <[email protected]>
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > Hi,
>>>>> >> >> >
>>>>> >> >> > After to have imported the extracted data into my virtuoso
>>>>> server I
>>>>> >> >> > could
>>>>> >> >> > see that I had some strange data. By example all my URI start
>>>>> with
>>>>> >> >> > "http://dbpedia.org"; and not with "http://fr.dbpedia.org"; and
>>>>> I don't
>>>>> >> >> > have
>>>>> >> >> > the "prop-fr" properties too, whereas I put "fr" in all the
>>>>> >> >> > extraction
>>>>> >> >> > properties file.
>>>>> >> >> >
>>>>> >> >> > I could see too, if I compare the data from the
>>>>> http://fr.dbpedia.org
>>>>> >> >> > and
>>>>> >> >> > mine they are not the same. By example if you compare these two
>>>>> >> >> > sparql
>>>>> >> >> > results :
>>>>> >> >> >
>>>>> >> >> > mine :
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> http://data.lirmm.fr:8890/sparql?default-graph-uri=&query=select+distinct+*+where+%7B%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FToulouse%3E+%3Fp+%3Fo%7D&should-sponge=&format=text%2Fhtml&timeout=0&debug=on
>>>>> >> >> >
>>>>> >> >> > fr.dbpedia.org :
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> http://fr.dbpedia.org/sparql?default-graph-uri=&query=select+distinct+*+where+%7B%3Chttp%3A%2F%2Ffr.dbpedia.org%2Fresource%2FToulouse%3E+%3Fp+%3Fo%7D&format=text%2Fhtml&timeout=0&debug=on
>>>>> >> >> >
>>>>> >> >> > In mine, I don't have the "
>>>>> http://www.w3.org/2002/07/owl#sameAs"; or
>>>>> >> >>
>>>>> >> >> Do you mean the triples like
>>>>> http://www.w3.org/2002/07/owl#sameAs
>>>>> >> >> http://de.dbpedia.org/resource/Toulouse ? To get them, you
>>>>> would have
>>>>> >> >> to download Wikipedia dumps for several other languages, run
>>>>> >> >> InterlangueLinkExtractor on them, and then run
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/scala/org/dbpedia/extraction/scripts/ProcessInterLanguageLinks.scala
>>>>> >> >> on all the result files.
>>>>> >> >>
>>>>> >> >> Or you could use the links in
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> http://downloads.dbpedia.org/3.8/fr/interlanguage_links_same_as_chapters_fr.ttl.bz2
>>>>> >> >> or a similar file.
>>>>> >> >>
>>>>> >> >> > "http://fr.dbpedia.org/property/population"; properties among
>>>>> many
>>>>> >> >> > others.
>>>>> >> >> >
>>>>> >> >> > In attachment my extraction property file.
>>>>> >> >> >
>>>>> >> >> > What I did wrong ?
>>>>> >> >> >
>>>>> >> >> > Best.
>>>>> >> >> >
>>>>> >> >> > Julien.
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> ------------------------------------------------------------------------------
>>>>> >> >> > Precog is a next-generation analytics platform capable of
>>>>> advanced
>>>>> >> >> > analytics on semi-structured data. The platform includes APIs
>>>>> for
>>>>> >> >> > building
>>>>> >> >> > apps and a phenomenal toolset for data science. Developers can
>>>>> use
>>>>> >> >> > our toolset for easy data analysis & visualization. Get a free
>>>>> >> >> > account!
>>>>> >> >> > http://www2.precog.com/precogplatform/slashdotnewsletter
>>>>> >> >> > _______________________________________________
>>>>> >> >> > Dbpedia-discussion mailing list
>>>>> >> >> > [email protected]
>>>>> >> >> >
>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>> >> >> >
>>>>> >> >
>>>>> >> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Precog is a next-generation analytics platform capable of advanced
>>>> analytics on semi-structured data. The platform includes APIs for
>>>> building
>>>> apps and a phenomenal toolset for data science. Developers can use
>>>> our toolset for easy data analysis & visualization. Get a free account!
>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>>> _______________________________________________
>>>> Dbpedia-discussion mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>
>>>>
>>>>
>>>
>>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to