[Dbpedia-gsoc] Fwd: GSoC : Crowdsource tests and extraction rules

Rahul Sharnagat Sun, 28 Apr 2013 05:02:41 -0700

Hi Jona,
     I earlier attached those files. But because of my habit hitting just
reply it went only to Dimitris. Please also give me some pointer for the
doubts mentioned below regarding proposal.


---------- Forwarded message ----------
From: Rahul Sharnagat <[email protected]>
Date: Sat, Apr 27, 2013 at 1:53 PM
Subject: Re: [Dbpedia-gsoc] GSoC : Crowdsource tests and extraction rules
To: Dimitris Kontokostas <[email protected]>


Hi Christopher, Dimitris
     Thanks Dimitris for the proxy help. Dump download went easily. But
while running the extraction with extraction.default.properties. there was
error for missing wikipedias.csv. I went through the mail archieve and
found this solution
here<http://www.mail-archive.com/[email protected]/msg03921.html>.
Changed languages to en instead of 10000- . New error has occured for which
i have attached the logs. it is trying to find arwiki when there is none in
base-dir.

*Regarding proposal*

     I have started writing my proposal and am facing some problems
relating to proposing a tentative solution. Since there are three objective
for this idea . I am providing what abstract thought i have for each problem

   - Extending the DBpedia mapping wiki so that the editors could provide
   the rules for the data format that need to extracted

I looked into the code of dataparser . Currently, let say for month
information, config file extraction.config.dataparser  defines how months
in each language need to be parsed .So the solution to the problem would be
to define a module that stores and access and specify the rules for this
info instead of writing scala code. Is this what is expected from this task
? As christopher mentioned in issue #36 bulding a DSL could be solution.

   - Moving data types from extraction code to the mapping wiki ontology

Seems easy to do. I just didn't find the code that modifies wiki. Or how
mapping wiki extracts info from this code . (not very clear on this, need
some pointers in code)

   - Extend mapping wiki to include tests could be specified on wiki page
   so that the community can contribute

Do i need to elaborate extensively on what kind of test should i be
implementing. It would be very helpful if you can elaborate on testing. I
understand that we need a module that take input from mapping wiki users
for a particular language and a way that defines these tests and validate
the extraction result. Can give me some pointer regarding implementation of
test cases?

Thanks



On Thu, Apr 25, 2013 at 12:19 PM, Dimitris Kontokostas <[email protected]>wrote:

> Hi Rahul,
>
> You should put your main effort in your application but I think this task
> will also help you get a better idea on what to expect.
>
> Regarding the proxy, we have the following launcher in dump/pom.xml,
> please uncomment and adappt the proxy settings
>                         <launcher>
>                             <id>download</id>
>
> <mainClass>org.dbpedia.extraction.dump.download.Download</mainClass>
>                             <!--
>                             <jvmArgs>
>                                 <jvmArg>-Dhttp.proxyHost=proxy.server.com
> </jvmArg>
>                                 <jvmArg>-Dhttp.proxyPort=80</jvmArg>
>
> <jvmArg>-Dhttp.nonProxyHosts="localhost|127.0.0.1"</jvmArg>
>                             </jvmArgs>
>                             -->
>                             <!-- ../run download
> config=download.properties -->
>                         </launcher>
>
>
> On Thu, Apr 25, 2013 at 5:29 AM, Rahul Sharnagat <[email protected]>wrote:
>
>> Hi Jona,
>>      I think, i know the problem. I am on my institute network which
>> works through a proxy server. To get the maven working i had to set the
>> proxy settings in settings.xml and  provided it to mvn command but
>> currently i am putting in $HOME/.m2/ folder. Is downloading of wiki dump
>> accepts the maven proxy setting or global environment of http_proxy? May be
>> this can be the source of error. I will try to get on a no proxy network
>> and try it again.
>>
>>
>>
>> On Thu, Apr 25, 2013 at 4:09 AM, Jona Christopher Sahnwaldt <
>> [email protected]> wrote:
>>
>>> On 24 April 2013 20:55, Rahul Sharnagat <[email protected]> wrote:
>>> > Hi Dimitris,
>>> >     Since last few days, i am trying to understand the dataparser and
>>> > mapping code.I also went little higher in hierarchy to understand the
>>> > dependencies. Things are getting clear now but will take some more
>>> time to
>>> > understand all nuances. Also I successfully installed the extraction
>>> > framework.
>>> >     But there is one problem for getting the dump to work upon. As per
>>> > documentation (here and here), i could not find
>>> download.properties.file in
>>> > master branch in dump folder. But i explored the folder and found
>>> > download.minimal.properties. I tweaked it according to instructions
>>> for my
>>> > requirement but i am getting a error (attached is full debug log and
>>> tweaked
>>> > minimal.properties). I tried to find similar error in archived message
>>> but
>>> > could not find it. Can you help me in this regard ?
>>>
>>> Strange. Could you just try again? It works for me. Maybe it was a
>>> temporary problem at Wikimedia. Or maybe something is wrong with your
>>> network? What does http://dumps.wikimedia.org/enwiki/ look like in
>>> your browser?
>>>
>>> I updated extraction-framework to the latest version from GitHub,
>>> copied your download.minimal.properties file into my dump/ folder,
>>> changed the value of base-dir and executed
>>>
>>> ../clean-install-run download config=download.minimal.properties
>>>
>>> Below is an excerpt from the result.
>>>
>>> Cheers,
>>> JC
>>>
>>> [INFO] launcher 'download' selected =>
>>> org.dbpedia.extraction.dump.download.Download
>>> done: 0 -
>>> todo: 1 - wiki=en,locale=en
>>> downloading 'http://dumps.wikimedia.org/enwiki/' to
>>> '/Users/jcsahnwaldt/tmp/enwiki/index.html'
>>> read 3.6132812 KB of 3.6132812 KB in 0.014 seconds (258.0915 KB/s)
>>> downloading 'http://dumps.wikimedia.org/enwiki/20130403/' to
>>> '/Users/jcsahnwaldt/tmp/enwiki/20130403/index.html'
>>> read 102.23535 KB of 102.23535 KB in 0.907 seconds (112.71813 KB/s)
>>> date page 'http://dumps.wikimedia.org/enwiki/20130403/' has all files
>>> [pages-articles.xml.bz2]
>>> downloading '
>>> http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-pages-articles.xml.bz2
>>> '
>>> to
>>> '/Users/jcsahnwaldt/tmp/enwiki/20130403/enwiki-20130403-pages-articles.xml.bz2'
>>>
>>>
>>> >     I am also reading Dbpedia mapping wiki to understand how ontology
>>> is
>>> > created and infobox to ontology mapping is done and relate it to code.
>>> Since
>>> > little more  than a week is left for final proposal, I want to create
>>> a good
>>> > draft by 1st. I will try to send a rough draft by tomorrow.
>>> >
>>> > Thanks.
>>> >
>>> >
>>> >
>>> > On Tue, Apr 23, 2013 at 11:58 AM, Rahul Sharnagat <
>>> [email protected]>
>>> > wrote:
>>> >>
>>> >> Thanks Dimitris.
>>> >> I will look into this issue and related code and  get back to you if i
>>> >> face any problems.
>>> >>
>>> >>
>>> >> On Mon, Apr 22, 2013 at 6:07 PM, Dimitris Kontokostas <
>>> [email protected]>
>>> >> wrote:
>>> >>>
>>> >>> Hi Rahul,
>>> >>>
>>> >>> A very good warm-up task for this idea is issue #36
>>> >>> (https://github.com/dbpedia/extraction-framework/issues/36)
>>> >>> With this task you will get to know the parser internals and see the
>>> >>> actual need to crowd-source the rules.
>>> >>>
>>> >>> Take a first look and we'll be available for further details
>>> >>>
>>> >>> Cheers,
>>> >>> Dimitris
>>> >>>
>>> >>>
>>> >>> On Mon, Apr 22, 2013 at 5:02 AM, Rahul Sharnagat <
>>> [email protected]>
>>> >>> wrote:
>>> >>>>
>>> >>>> Sorry, forgot to add mailing list. Just hit the reply button. :)
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Apr 22, 2013 at 2:19 AM, Dimitris Kontokostas
>>> >>>> <[email protected]> wrote:
>>> >>>>>
>>> >>>>> Please put the mailing list in cc :)
>>> >>>>>
>>> >>>>> Cheers,
>>> >>>>> Dimitris
>>> >>>>>
>>> >>>>> ----
>>> >>>>> Send from my mobile
>>> >>>>>
>>> >>>>> Στις 21 Απρ 2013 7:55 μ.μ., ο χρήστης "Rahul Sharnagat"
>>> >>>>> <[email protected]> έγραψε:
>>> >>>>>
>>> >>>>>> Hi Dimitris,
>>> >>>>>>         Thanks for the reply.
>>> >>>>>>         I am looking for some warm up task relating to this idea
>>> . I
>>> >>>>>> have started reading about scala and Dbpedia. It should not take
>>> much time
>>> >>>>>> to get accustomed to scala since i have previously worked in
>>> haskell. Please
>>> >>>>>> give me some direction for a warm up task.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Sun, Apr 21, 2013 at 9:39 PM, Dimitris Kontokostas
>>> >>>>>> <[email protected]> wrote:
>>> >>>>>>>
>>> >>>>>>> Hi Rahul,
>>> >>>>>>>
>>> >>>>>>> The application period did not start yet so there is still time
>>> left
>>> >>>>>>> :)
>>> >>>>>>>
>>> >>>>>>> Did you read the idea page [1]? The description is pretty big
>>> but you
>>> >>>>>>> can ask anything you don't understand completely.
>>> >>>>>>> Everything should be clear when you write your application ;)
>>> >>>>>>>
>>> >>>>>>> Best,
>>> >>>>>>> Dimitris
>>> >>>>>>>
>>> >>>>>>> [1]
>>> http://wiki.dbpedia.org/gsoc2013/ideas/CrowdsourceTestsAndRules
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Sun, Apr 21, 2013 at 4:06 PM, Rahul Sharnagat
>>> >>>>>>> <[email protected]> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Hi Dimitris,
>>> >>>>>>>>
>>> >>>>>>>>     I am Rahul Sharnagat, master student at IIT Bombay. I am
>>> >>>>>>>> planning to apply for DBpedia GSoC project.
>>> >>>>>>>>
>>> >>>>>>>>     I am interested in the project, Crowdsource tests and
>>> extraction
>>> >>>>>>>> rules. I am working on Named entity Recognition(NER) and
>>> Entiity mining as
>>> >>>>>>>> my masters project. I think working with Dbpedia would help me
>>> a lot in
>>> >>>>>>>> that. I have interned at Yahoo last summer working on refining
>>> news indexes.
>>> >>>>>>>>
>>> >>>>>>>>     I know I am late due to my final exams, but it will be
>>> great if
>>> >>>>>>>> you can help me get started. I have been reading dbpedia
>>> wikipages, also
>>> >>>>>>>> have downloaded code from github.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> --
>>> >>>>>>>> Best Regards,
>>> >>>>>>>> Rahul Sharnagat
>>> >>>>>>>> CSE MTech, IITB
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> ------------------------------------------------------------------------------
>>> >>>>>>>> Precog is a next-generation analytics platform capable of
>>> advanced
>>> >>>>>>>> analytics on semi-structured data. The platform includes APIs
>>> for
>>> >>>>>>>> building
>>> >>>>>>>> apps and a phenomenal toolset for data science. Developers can
>>> use
>>> >>>>>>>> our toolset for easy data analysis & visualization. Get a free
>>> >>>>>>>> account!
>>> >>>>>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>> >>>>>>>> _______________________________________________
>>> >>>>>>>> Dbpedia-gsoc mailing list
>>> >>>>>>>> [email protected]
>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Kontokostas Dimitris
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Best Regards,
>>> >>>>>> Rahul Sharnagat
>>> >>>>>> CSE MTech, IITB
>>> >>>>>> H14, B505
>>> >>>>>> +91.9860.451.056
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Best Regards,
>>> >>>> Rahul Sharnagat
>>> >>>> CSE MTech, IITB
>>> >>>> H14, B505
>>> >>>> +91.9860.451.056
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Kontokostas Dimitris
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards,
>>> >> Rahul Sharnagat
>>> >> CSE MTech, IITB
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards,
>>> > Rahul Sharnagat
>>> > CSE MTech, IITB
>>> > H14, B505
>>> > +91.9860.451.056
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > Try New Relic Now & We'll Send You this Cool Shirt
>>> > New Relic is the only SaaS-based application performance monitoring
>>> service
>>> > that delivers powerful full stack analytics. Optimize and monitor your
>>> > browser, app, & servers with just a few lines of code. Try New Relic
>>> > and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>> > _______________________________________________
>>> > Dbpedia-gsoc mailing list
>>> > [email protected]
>>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>> >
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Rahul Sharnagat
>> CSE MTech, IITB
>> H14, B505
>> +91.9860.451.056
>>
>
>
>
> --
> Kontokostas Dimitris
>



-- 
Best Regards,
Rahul Sharnagat
CSE MTech, IITB
H14, B505
+91.9860.451.056



-- 
Best Regards,
Rahul Sharnagat
CSE MTech, IITB
H14, B505
+91.9860.451.056

extraction.default.properties
Description: Binary data

ext_dump
Description: Binary data

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

[Dbpedia-gsoc] Fwd: GSoC : Crowdsource tests and extraction rules

Reply via email to