Re: [Dbpedia-gsoc] Fwd: GSoC : Crowdsource tests and extraction rules

Dimitris Kontokostas Tue, 30 Apr 2013 12:31:33 -0700

Hi Rahul,

I think you found a bug :)
remove all extractors.XX lines except from extractors.en and try again,
this should work


See inline for your questions

On Sun, Apr 28, 2013 at 10:30 AM, Rahul Sharnagat <[email protected]>wrote:

> Hi Jona,
>      I earlier attached those files. But because of my habit hitting just
> reply it went only to Dimitris. Please also give me some pointer for the
> doubts mentioned below regarding proposal.
>
> ---------- Forwarded message ----------
> From: Rahul Sharnagat <[email protected]>
> Date: Sat, Apr 27, 2013 at 1:53 PM
> Subject: Re: [Dbpedia-gsoc] GSoC : Crowdsource tests and extraction rules
> To: Dimitris Kontokostas <[email protected]>
>
>
> Hi Christopher, Dimitris
>      Thanks Dimitris for the proxy help. Dump download went easily. But
> while running the extraction with extraction.default.properties. there was
> error for missing wikipedias.csv. I went through the mail archieve and
> found this solution 
> here<http://www.mail-archive.com/[email protected]/msg03921.html>.
> Changed languages to en instead of 10000- . New error has occured for which
> i have attached the logs. it is trying to find arwiki when there is none in
> base-dir.
>
> *Regarding proposal*
>
>      I have started writing my proposal and am facing some problems
> relating to proposing a tentative solution. Since there are three objective
> for this idea . I am providing what abstract thought i have for each problem
>
>    - Extending the DBpedia mapping wiki so that the editors could provide
>    the rules for the data format that need to extracted
>
> I looked into the code of dataparser . Currently, let say for month
> information, config file extraction.config.dataparser  defines how months
> in each language need to be parsed .So the solution to the problem would be
> to define a module that stores and access and specify the rules for this
> info instead of writing scala code. Is this what is expected from this task
> ? As christopher mentioned in issue #36 bulding a DSL could be solution.
>

Yes, this is the idea. DBpedia already has the code to parse mediawiki
pages so your task will be to define rules in wiki markup and then parse
these rules with the framework.
This is exactly what we do with the Infobox mappings & DBpedia ontology (
mappings.dbpedia.org), we developerd a specific markup for definitions and
we parse these pages when we start the extraction process.

For example see the source of the following pages
http://mappings.dbpedia.org/index.php/OntologyClass:Temple
http://mappings.dbpedia.org/index.php/OntologyProperty:PlayRole
http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_software

>
>    - Moving data types from extraction code to the mapping wiki ontology
>
> Seems easy to do. I just didn't find the code that modifies wiki. Or how
> mapping wiki extracts info from this code . (not very clear on this, need
> some pointers in code)
>

people write rules on the mappings wiki using simple wiki markup code (that
is predefined), and then, when the scala framework runs, it parses these
rules.
So, we don't modify the wiki with code, we just read it (parse) and the the
wiki doesn't extract anything, the scala framework extracts from the wiki.

This is an example code that reads the ontology from the mappings wiki
here we read & parse the wikipages
https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/ConfigLoader.scala#L167
and here we try to reconstruct the ontology from the ontology templates we
defined
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scala


>    - Extend mapping wiki to include tests could be specified on wiki page
>    so that the community can contribute
>
> Do i need to elaborate extensively on what kind of test should i be
> implementing. It would be very helpful if you can elaborate on testing. I
> understand that we need a module that take input from mapping wiki users
> for a particular language and a way that defines these tests and validate
> the extraction result. Can give me some pointer regarding implementation of
> test cases?
>

Maybe Jone can elaborate further on this but I think he means rules like
these
https://github.com/dbpedia/extraction-framework/tree/master/core/src/test/scala/org/dbpedia/extraction/dataparser

Let us know if you have more questions

Cheers,
Dimitris


>  Thanks
>
>
>
> On Thu, Apr 25, 2013 at 12:19 PM, Dimitris Kontokostas 
> <[email protected]>wrote:
>
>> Hi Rahul,
>>
>> You should put your main effort in your application but I think this task
>> will also help you get a better idea on what to expect.
>>
>> Regarding the proxy, we have the following launcher in dump/pom.xml,
>> please uncomment and adappt the proxy settings
>>                         <launcher>
>>                             <id>download</id>
>>
>> <mainClass>org.dbpedia.extraction.dump.download.Download</mainClass>
>>                             <!--
>>                             <jvmArgs>
>>                                 <jvmArg>-Dhttp.proxyHost=proxy.server.com
>> </jvmArg>
>>                                 <jvmArg>-Dhttp.proxyPort=80</jvmArg>
>>
>> <jvmArg>-Dhttp.nonProxyHosts="localhost|127.0.0.1"</jvmArg>
>>                             </jvmArgs>
>>                             -->
>>                             <!-- ../run download
>> config=download.properties -->
>>                         </launcher>
>>
>>
>> On Thu, Apr 25, 2013 at 5:29 AM, Rahul Sharnagat 
>> <[email protected]>wrote:
>>
>>> Hi Jona,
>>>      I think, i know the problem. I am on my institute network which
>>> works through a proxy server. To get the maven working i had to set the
>>> proxy settings in settings.xml and  provided it to mvn command but
>>> currently i am putting in $HOME/.m2/ folder. Is downloading of wiki dump
>>> accepts the maven proxy setting or global environment of http_proxy? May be
>>> this can be the source of error. I will try to get on a no proxy network
>>> and try it again.
>>>
>>>
>>>
>>> On Thu, Apr 25, 2013 at 4:09 AM, Jona Christopher Sahnwaldt <
>>> [email protected]> wrote:
>>>
>>>> On 24 April 2013 20:55, Rahul Sharnagat <[email protected]> wrote:
>>>> > Hi Dimitris,
>>>> >     Since last few days, i am trying to understand the dataparser and
>>>> > mapping code.I also went little higher in hierarchy to understand the
>>>> > dependencies. Things are getting clear now but will take some more
>>>> time to
>>>> > understand all nuances. Also I successfully installed the extraction
>>>> > framework.
>>>> >     But there is one problem for getting the dump to work upon. As per
>>>> > documentation (here and here), i could not find
>>>> download.properties.file in
>>>> > master branch in dump folder. But i explored the folder and found
>>>> > download.minimal.properties. I tweaked it according to instructions
>>>> for my
>>>> > requirement but i am getting a error (attached is full debug log and
>>>> tweaked
>>>> > minimal.properties). I tried to find similar error in archived
>>>> message but
>>>> > could not find it. Can you help me in this regard ?
>>>>
>>>> Strange. Could you just try again? It works for me. Maybe it was a
>>>> temporary problem at Wikimedia. Or maybe something is wrong with your
>>>> network? What does http://dumps.wikimedia.org/enwiki/ look like in
>>>> your browser?
>>>>
>>>> I updated extraction-framework to the latest version from GitHub,
>>>> copied your download.minimal.properties file into my dump/ folder,
>>>> changed the value of base-dir and executed
>>>>
>>>> ../clean-install-run download config=download.minimal.properties
>>>>
>>>> Below is an excerpt from the result.
>>>>
>>>> Cheers,
>>>> JC
>>>>
>>>> [INFO] launcher 'download' selected =>
>>>> org.dbpedia.extraction.dump.download.Download
>>>> done: 0 -
>>>> todo: 1 - wiki=en,locale=en
>>>> downloading 'http://dumps.wikimedia.org/enwiki/' to
>>>> '/Users/jcsahnwaldt/tmp/enwiki/index.html'
>>>> read 3.6132812 KB of 3.6132812 KB in 0.014 seconds (258.0915 KB/s)
>>>> downloading 'http://dumps.wikimedia.org/enwiki/20130403/' to
>>>> '/Users/jcsahnwaldt/tmp/enwiki/20130403/index.html'
>>>> read 102.23535 KB of 102.23535 KB in 0.907 seconds (112.71813 KB/s)
>>>> date page 'http://dumps.wikimedia.org/enwiki/20130403/' has all files
>>>> [pages-articles.xml.bz2]
>>>> downloading '
>>>> http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-pages-articles.xml.bz2
>>>> '
>>>> to
>>>> '/Users/jcsahnwaldt/tmp/enwiki/20130403/enwiki-20130403-pages-articles.xml.bz2'
>>>>
>>>>
>>>> >     I am also reading Dbpedia mapping wiki to understand how ontology
>>>> is
>>>> > created and infobox to ontology mapping is done and relate it to
>>>> code. Since
>>>> > little more  than a week is left for final proposal, I want to create
>>>> a good
>>>> > draft by 1st. I will try to send a rough draft by tomorrow.
>>>> >
>>>> > Thanks.
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Apr 23, 2013 at 11:58 AM, Rahul Sharnagat <
>>>> [email protected]>
>>>> > wrote:
>>>> >>
>>>> >> Thanks Dimitris.
>>>> >> I will look into this issue and related code and  get back to you if
>>>> i
>>>> >> face any problems.
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 22, 2013 at 6:07 PM, Dimitris Kontokostas <
>>>> [email protected]>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi Rahul,
>>>> >>>
>>>> >>> A very good warm-up task for this idea is issue #36
>>>> >>> (https://github.com/dbpedia/extraction-framework/issues/36)
>>>> >>> With this task you will get to know the parser internals and see the
>>>> >>> actual need to crowd-source the rules.
>>>> >>>
>>>> >>> Take a first look and we'll be available for further details
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Dimitris
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Apr 22, 2013 at 5:02 AM, Rahul Sharnagat <
>>>> [email protected]>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Sorry, forgot to add mailing list. Just hit the reply button. :)
>>>> >>>>
>>>> >>>>
>>>> >>>> On Mon, Apr 22, 2013 at 2:19 AM, Dimitris Kontokostas
>>>> >>>> <[email protected]> wrote:
>>>> >>>>>
>>>> >>>>> Please put the mailing list in cc :)
>>>> >>>>>
>>>> >>>>> Cheers,
>>>> >>>>> Dimitris
>>>> >>>>>
>>>> >>>>> ----
>>>> >>>>> Send from my mobile
>>>> >>>>>
>>>> >>>>> Στις 21 Απρ 2013 7:55 μ.μ., ο χρήστης "Rahul Sharnagat"
>>>> >>>>> <[email protected]> έγραψε:
>>>> >>>>>
>>>> >>>>>> Hi Dimitris,
>>>> >>>>>>         Thanks for the reply.
>>>> >>>>>>         I am looking for some warm up task relating to this idea
>>>> . I
>>>> >>>>>> have started reading about scala and Dbpedia. It should not take
>>>> much time
>>>> >>>>>> to get accustomed to scala since i have previously worked in
>>>> haskell. Please
>>>> >>>>>> give me some direction for a warm up task.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Sun, Apr 21, 2013 at 9:39 PM, Dimitris Kontokostas
>>>> >>>>>> <[email protected]> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Hi Rahul,
>>>> >>>>>>>
>>>> >>>>>>> The application period did not start yet so there is still time
>>>> left
>>>> >>>>>>> :)
>>>> >>>>>>>
>>>> >>>>>>> Did you read the idea page [1]? The description is pretty big
>>>> but you
>>>> >>>>>>> can ask anything you don't understand completely.
>>>> >>>>>>> Everything should be clear when you write your application ;)
>>>> >>>>>>>
>>>> >>>>>>> Best,
>>>> >>>>>>> Dimitris
>>>> >>>>>>>
>>>> >>>>>>> [1]
>>>> http://wiki.dbpedia.org/gsoc2013/ideas/CrowdsourceTestsAndRules
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Sun, Apr 21, 2013 at 4:06 PM, Rahul Sharnagat
>>>> >>>>>>> <[email protected]> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> Hi Dimitris,
>>>> >>>>>>>>
>>>> >>>>>>>>     I am Rahul Sharnagat, master student at IIT Bombay. I am
>>>> >>>>>>>> planning to apply for DBpedia GSoC project.
>>>> >>>>>>>>
>>>> >>>>>>>>     I am interested in the project, Crowdsource tests and
>>>> extraction
>>>> >>>>>>>> rules. I am working on Named entity Recognition(NER) and
>>>> Entiity mining as
>>>> >>>>>>>> my masters project. I think working with Dbpedia would help me
>>>> a lot in
>>>> >>>>>>>> that. I have interned at Yahoo last summer working on refining
>>>> news indexes.
>>>> >>>>>>>>
>>>> >>>>>>>>     I know I am late due to my final exams, but it will be
>>>> great if
>>>> >>>>>>>> you can help me get started. I have been reading dbpedia
>>>> wikipages, also
>>>> >>>>>>>> have downloaded code from github.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> --
>>>> >>>>>>>> Best Regards,
>>>> >>>>>>>> Rahul Sharnagat
>>>> >>>>>>>> CSE MTech, IITB
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> ------------------------------------------------------------------------------
>>>> >>>>>>>> Precog is a next-generation analytics platform capable of
>>>> advanced
>>>> >>>>>>>> analytics on semi-structured data. The platform includes APIs
>>>> for
>>>> >>>>>>>> building
>>>> >>>>>>>> apps and a phenomenal toolset for data science. Developers can
>>>> use
>>>> >>>>>>>> our toolset for easy data analysis & visualization. Get a free
>>>> >>>>>>>> account!
>>>> >>>>>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>>> >>>>>>>> _______________________________________________
>>>> >>>>>>>> Dbpedia-gsoc mailing list
>>>> >>>>>>>> [email protected]
>>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>> >>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Kontokostas Dimitris
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> --
>>>> >>>>>> Best Regards,
>>>> >>>>>> Rahul Sharnagat
>>>> >>>>>> CSE MTech, IITB
>>>> >>>>>> H14, B505
>>>> >>>>>> +91.9860.451.056
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Best Regards,
>>>> >>>> Rahul Sharnagat
>>>> >>>> CSE MTech, IITB
>>>> >>>> H14, B505
>>>> >>>> +91.9860.451.056
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Kontokostas Dimitris
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Best Regards,
>>>> >> Rahul Sharnagat
>>>> >> CSE MTech, IITB
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best Regards,
>>>> > Rahul Sharnagat
>>>> > CSE MTech, IITB
>>>> > H14, B505
>>>> > +91.9860.451.056
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > Try New Relic Now & We'll Send You this Cool Shirt
>>>> > New Relic is the only SaaS-based application performance monitoring
>>>> service
>>>> > that delivers powerful full stack analytics. Optimize and monitor your
>>>> > browser, app, & servers with just a few lines of code. Try New Relic
>>>> > and get this awesome Nerd Life shirt!
>>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>>> > _______________________________________________
>>>> > Dbpedia-gsoc mailing list
>>>> > [email protected]
>>>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Rahul Sharnagat
>>> CSE MTech, IITB
>>> H14, B505
>>> +91.9860.451.056
>>>
>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>
>
> --
> Best Regards,
> Rahul Sharnagat
> CSE MTech, IITB
> H14, B505
> +91.9860.451.056
>
>
>
> --
> Best Regards,
> Rahul Sharnagat
> CSE MTech, IITB
> H14, B505
> +91.9860.451.056
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>


-- 
Kontokostas Dimitris

------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] Fwd: GSoC : Crowdsource tests and extraction rules

Reply via email to