Re: [Dbpedia-gsoc] GSoC : Crowdsource tests and extraction rules

Jona Christopher Sahnwaldt Tue, 30 Apr 2013 17:59:57 -0700

Hi Rahul,

in addition to what Dimitris wrote, here are my 0.02�.


On 28 April 2013 09:30, Rahul Sharnagat <[email protected]> wrote:
> Hi Jona,
>      I earlier attached those files. But because of my habit hitting just
> reply it went only to Dimitris. Please also give me some pointer for the
> doubts mentioned below regarding proposal.
>
> ---------- Forwarded message ----------
> From: Rahul Sharnagat <[email protected]>
> Date: Sat, Apr 27, 2013 at 1:53 PM
> Subject: Re: [Dbpedia-gsoc] GSoC : Crowdsource tests and extraction rules
> To: Dimitris Kontokostas <[email protected]>
>
>
> Hi Christopher, Dimitris
>      Thanks Dimitris for the proxy help. Dump download went easily. But
> while running the extraction with extraction.default.properties. there was
> error for missing wikipedias.csv. I went through the mail archieve and found
> this solution here. Changed languages to en instead of 10000- . New error
> has occured for which i have attached the logs. it is trying to find arwiki
> when there is none in base-dir.
>
> Regarding proposal
>
>      I have started writing my proposal and am facing some problems relating
> to proposing a tentative solution. Since there are three objective for this
> idea . I am providing what abstract thought i have for each problem
>
> Extending the DBpedia mapping wiki so that the editors could provide the
> rules for the data format that need to extracted
>
> I looked into the code of dataparser . Currently, let say for month
> information, config file extraction.config.dataparser  defines how months in
> each language need to be parsed .So the solution to the problem would be to
> define a module that stores and access and specify the rules for this info
> instead of writing scala code. Is this what is expected from this task ? As
> christopher mentioned in issue #36 bulding a DSL could be solution.

The possibilities are endless, and I find it hard to pinpoint the area
that's most promising or most important. You could look at DBpedia
Wiktionary. They have an XML-based DSL. I think their DSL can only
specify how to handle certain parts of wikitext, mostly templates, but
I'm not sure. That part would be helpful, for example for the list
templates mentioned in #36 and many other templates like {{Birth date
and age}} or {{Convert}}, but we also need rules for unstructured text
like the months that you mentioned.

I guess the hardest part of this project is to come up with a language
that is flexible enough to cover a lot of parsing problems we have and
at the same time simple enough for users and implementors to
understand.

On one end of the complexity spectrum, I guess there's a relatively
simple but unflexible solution that allows users to specify
configurations for certain DBpedia parsers. In other words, users
would simply add lists of month names to the mappings wiki. Similarly
for many other parsers: date formats, number formats, etc. Although
that wouldn't allow users to configure many other parsing steps, such
a system would also be a great help. Maybe that's the way to go,
simply because it's doable.

On the other end of the spectrum, there is a super-flexible language
that basically allows users to specify arbitrary patterns - a bit like
regexes, but with nicer syntax - and mappings to RDF types. For
example, users could specify that in in certain parts of their
Wikipedia edition, a pattern like "[[A]] ([[B]], [[C]])" almost always
means that A is a person, B a city and C a country. (I'm just making
this up.) Or that <number> <month> <year> is a date, where <number>,
<month> and <year> are in turn user-defined patterns. And so on.

I hope that gives you an idea of the possibilities.

>
> Moving data types from extraction code to the mapping wiki ontology
>
> Seems easy to do. I just didn't find the code that modifies wiki.

All the types in OntologyDatatypes.scala [1] should be on the mappings
wiki, probably one page per type, just as there is one page per
ontology class. So we would need code that puts them there.

Before the mappings wiki, the mappings were defined in files that we
stored in SVN. When we started the mappings wiki, we wrote some code
(probably in PHP) that parsed these files and copied their content to
the mappings wiki, using the MediaWiki API. Or did we put them
directly into the database? I don't remember. Maybe you can find that
code somewhere at http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/

But it doesn't matter much, you can start from scratch. It's not
really a complex task - the data types are already Scala objects, so
you don't need to parse anything. All you have to do is convert them
to wikitext syntax and call http://mappings.dbpedia.org/api.php to add
a page for each of them.

You may also have to extend the syntax that we currently use on the
mappings wiki to define RDF types. Again, that's the hard part of this
task. The implementation is then comparatively simple.

[1]  
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/ontology/OntologyDatatypes.scala

> Or how
> mapping wiki extracts info from this code . (not very clear on this, need
> some pointers in code)

This class reads the ontology definition pages on the mappings wiki:

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/ontology/io/OntologyReader.scala

This class loads the mappings:

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/MappingsLoader.scala

I think for the datatypes, you would have to extend
OntologyReader.scala a bit. Basically, you have to replace this line

ontologyBuilder.datatypes = OntologyDatatypes.load()

by a few methods that read the data type definition pages.

>
> Extend mapping wiki to include tests could be specified on wiki page so that
> the community can contribute
>
> Do i need to elaborate extensively on what kind of test should i be
> implementing. It would be very helpful if you can elaborate on testing. I
> understand that we need a module that take input from mapping wiki users for
> a particular language and a way that defines these tests and validate the
> extraction result. Can give me some pointer regarding implementation of test
> cases?

As with the parser rules above, there are many possibilities.

The general idea is this: users specify test cases on the mappings
wiki. Your code downloads the test cases, executes them and reports
any errors. The code would be integrated into the extraction framework
build process, just like unit tests. That way, we could be more
confident when we make changes to the extraction and parser code that
we don't introduce too many new errors when we try to parse new number
formats etc.

I see at least two granularities for the tests: the fine-grained tests
would be on the level of the existing dataparser tests [1]. Example
from DateTimeParserTest.scala:

"DataParser" should "return date (16. March 1969, 08:20 UTC)" in
{
    parse("en", "xsd:date", "16. March 1969, 08:20 UTC") should equal
(Some("1969-03-16"))
}

But there are hundreds of ways that dates, numbers etc. are written on
Wikipedia, and I don't think we can cover them all. I think we should
also have a set of a few hundred Wikipedia pages (more or less
randomly chosen) that we upload to the mappings wiki, along with the
RDF triples that we expect the framework to extract from them. These
would be the coarse-grained tests.

Unlike normal unit tests, we should probably not expect 100% of the
tests to succeed, just maybe 99% or 95%, I don't know.

[1] 
https://github.com/dbpedia/extraction-framework/tree/master/core/src/test/scala/org/dbpedia/extraction/dataparser

By the way, these tests haven't been run in a while because we didn't
have time to maintain them:

https://github.com/dbpedia/extraction-framework/blob/master/pom.xml

<skipTests>true</skipTests>

:-(


Hope that helps a bit. Feel free to ask more questions, although I'm
really busy right now and might not reply quickly.


JC


>
> Thanks
>
>
>
> On Thu, Apr 25, 2013 at 12:19 PM, Dimitris Kontokostas <[email protected]>
> wrote:
>>
>> Hi Rahul,
>>
>> You should put your main effort in your application but I think this task
>> will also help you get a better idea on what to expect.
>>
>> Regarding the proxy, we have the following launcher in dump/pom.xml,
>> please uncomment and adappt the proxy settings
>>                         <launcher>
>>                             <id>download</id>
>>
>> <mainClass>org.dbpedia.extraction.dump.download.Download</mainClass>
>>                             <!--
>>                             <jvmArgs>
>>
>> <jvmArg>-Dhttp.proxyHost=proxy.server.com</jvmArg>
>>                                 <jvmArg>-Dhttp.proxyPort=80</jvmArg>
>>
>> <jvmArg>-Dhttp.nonProxyHosts="localhost|127.0.0.1"</jvmArg>
>>                             </jvmArgs>
>>                             -->
>>                             <!-- ../run download
>> config=download.properties -->
>>                         </launcher>
>>
>>
>> On Thu, Apr 25, 2013 at 5:29 AM, Rahul Sharnagat <[email protected]>
>> wrote:
>>>
>>> Hi Jona,
>>>      I think, i know the problem. I am on my institute network which
>>> works through a proxy server. To get the maven working i had to set the
>>> proxy settings in settings.xml and  provided it to mvn command but currently
>>> i am putting in $HOME/.m2/ folder. Is downloading of wiki dump accepts the
>>> maven proxy setting or global environment of http_proxy? May be this can be
>>> the source of error. I will try to get on a no proxy network and try it
>>> again.
>>>
>>>
>>>
>>> On Thu, Apr 25, 2013 at 4:09 AM, Jona Christopher Sahnwaldt
>>> <[email protected]> wrote:
>>>>
>>>> On 24 April 2013 20:55, Rahul Sharnagat <[email protected]> wrote:
>>>> > Hi Dimitris,
>>>> >     Since last few days, i am trying to understand the dataparser and
>>>> > mapping code.I also went little higher in hierarchy to understand the
>>>> > dependencies. Things are getting clear now but will take some more
>>>> > time to
>>>> > understand all nuances. Also I successfully installed the extraction
>>>> > framework.
>>>> >     But there is one problem for getting the dump to work upon. As per
>>>> > documentation (here and here), i could not find
>>>> > download.properties.file in
>>>> > master branch in dump folder. But i explored the folder and found
>>>> > download.minimal.properties. I tweaked it according to instructions
>>>> > for my
>>>> > requirement but i am getting a error (attached is full debug log and
>>>> > tweaked
>>>> > minimal.properties). I tried to find similar error in archived message
>>>> > but
>>>> > could not find it. Can you help me in this regard ?
>>>>
>>>> Strange. Could you just try again? It works for me. Maybe it was a
>>>> temporary problem at Wikimedia. Or maybe something is wrong with your
>>>> network? What does http://dumps.wikimedia.org/enwiki/ look like in
>>>> your browser?
>>>>
>>>> I updated extraction-framework to the latest version from GitHub,
>>>> copied your download.minimal.properties file into my dump/ folder,
>>>> changed the value of base-dir and executed
>>>>
>>>> ../clean-install-run download config=download.minimal.properties
>>>>
>>>> Below is an excerpt from the result.
>>>>
>>>> Cheers,
>>>> JC
>>>>
>>>> [INFO] launcher 'download' selected =>
>>>> org.dbpedia.extraction.dump.download.Download
>>>> done: 0 -
>>>> todo: 1 - wiki=en,locale=en
>>>> downloading 'http://dumps.wikimedia.org/enwiki/' to
>>>> '/Users/jcsahnwaldt/tmp/enwiki/index.html'
>>>> read 3.6132812 KB of 3.6132812 KB in 0.014 seconds (258.0915 KB/s)
>>>> downloading 'http://dumps.wikimedia.org/enwiki/20130403/' to
>>>> '/Users/jcsahnwaldt/tmp/enwiki/20130403/index.html'
>>>> read 102.23535 KB of 102.23535 KB in 0.907 seconds (112.71813 KB/s)
>>>> date page 'http://dumps.wikimedia.org/enwiki/20130403/' has all files
>>>> [pages-articles.xml.bz2]
>>>> downloading
>>>> 'http://dumps.wikimedia.org/enwiki/20130403/enwiki-20130403-pages-articles.xml.bz2'
>>>> to
>>>> '/Users/jcsahnwaldt/tmp/enwiki/20130403/enwiki-20130403-pages-articles.xml.bz2'
>>>>
>>>>
>>>> >     I am also reading Dbpedia mapping wiki to understand how ontology
>>>> > is
>>>> > created and infobox to ontology mapping is done and relate it to code.
>>>> > Since
>>>> > little more  than a week is left for final proposal, I want to create
>>>> > a good
>>>> > draft by 1st. I will try to send a rough draft by tomorrow.
>>>> >
>>>> > Thanks.
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Apr 23, 2013 at 11:58 AM, Rahul Sharnagat
>>>> > <[email protected]>
>>>> > wrote:
>>>> >>
>>>> >> Thanks Dimitris.
>>>> >> I will look into this issue and related code and  get back to you if
>>>> >> i
>>>> >> face any problems.
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 22, 2013 at 6:07 PM, Dimitris Kontokostas
>>>> >> <[email protected]>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi Rahul,
>>>> >>>
>>>> >>> A very good warm-up task for this idea is issue #36
>>>> >>> (https://github.com/dbpedia/extraction-framework/issues/36)
>>>> >>> With this task you will get to know the parser internals and see the
>>>> >>> actual need to crowd-source the rules.
>>>> >>>
>>>> >>> Take a first look and we'll be available for further details
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Dimitris
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Apr 22, 2013 at 5:02 AM, Rahul Sharnagat
>>>> >>> <[email protected]>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Sorry, forgot to add mailing list. Just hit the reply button. :)
>>>> >>>>
>>>> >>>>
>>>> >>>> On Mon, Apr 22, 2013 at 2:19 AM, Dimitris Kontokostas
>>>> >>>> <[email protected]> wrote:
>>>> >>>>>
>>>> >>>>> Please put the mailing list in cc :)
>>>> >>>>>
>>>> >>>>> Cheers,
>>>> >>>>> Dimitris
>>>> >>>>>
>>>> >>>>> ----
>>>> >>>>> Send from my mobile
>>>> >>>>>
>>>> >>>>> Στις 21 Απρ 2013 7:55 μ.μ., ο χρήστης "Rahul Sharnagat"
>>>> >>>>> <[email protected]> έγραψε:
>>>> >>>>>
>>>> >>>>>> Hi Dimitris,
>>>> >>>>>>         Thanks for the reply.
>>>> >>>>>>         I am looking for some warm up task relating to this idea
>>>> >>>>>> . I
>>>> >>>>>> have started reading about scala and Dbpedia. It should not take
>>>> >>>>>> much time
>>>> >>>>>> to get accustomed to scala since i have previously worked in
>>>> >>>>>> haskell. Please
>>>> >>>>>> give me some direction for a warm up task.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Sun, Apr 21, 2013 at 9:39 PM, Dimitris Kontokostas
>>>> >>>>>> <[email protected]> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Hi Rahul,
>>>> >>>>>>>
>>>> >>>>>>> The application period did not start yet so there is still time
>>>> >>>>>>> left
>>>> >>>>>>> :)
>>>> >>>>>>>
>>>> >>>>>>> Did you read the idea page [1]? The description is pretty big
>>>> >>>>>>> but you
>>>> >>>>>>> can ask anything you don't understand completely.
>>>> >>>>>>> Everything should be clear when you write your application ;)
>>>> >>>>>>>
>>>> >>>>>>> Best,
>>>> >>>>>>> Dimitris
>>>> >>>>>>>
>>>> >>>>>>> [1]
>>>> >>>>>>> http://wiki.dbpedia.org/gsoc2013/ideas/CrowdsourceTestsAndRules
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Sun, Apr 21, 2013 at 4:06 PM, Rahul Sharnagat
>>>> >>>>>>> <[email protected]> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> Hi Dimitris,
>>>> >>>>>>>>
>>>> >>>>>>>>     I am Rahul Sharnagat, master student at IIT Bombay. I am
>>>> >>>>>>>> planning to apply for DBpedia GSoC project.
>>>> >>>>>>>>
>>>> >>>>>>>>     I am interested in the project, Crowdsource tests and
>>>> >>>>>>>> extraction
>>>> >>>>>>>> rules. I am working on Named entity Recognition(NER) and
>>>> >>>>>>>> Entiity mining as
>>>> >>>>>>>> my masters project. I think working with Dbpedia would help me
>>>> >>>>>>>> a lot in
>>>> >>>>>>>> that. I have interned at Yahoo last summer working on refining
>>>> >>>>>>>> news indexes.
>>>> >>>>>>>>
>>>> >>>>>>>>     I know I am late due to my final exams, but it will be
>>>> >>>>>>>> great if
>>>> >>>>>>>> you can help me get started. I have been reading dbpedia
>>>> >>>>>>>> wikipages, also
>>>> >>>>>>>> have downloaded code from github.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> --
>>>> >>>>>>>> Best Regards,
>>>> >>>>>>>> Rahul Sharnagat
>>>> >>>>>>>> CSE MTech, IITB
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> ------------------------------------------------------------------------------
>>>> >>>>>>>> Precog is a next-generation analytics platform capable of
>>>> >>>>>>>> advanced
>>>> >>>>>>>> analytics on semi-structured data. The platform includes APIs
>>>> >>>>>>>> for
>>>> >>>>>>>> building
>>>> >>>>>>>> apps and a phenomenal toolset for data science. Developers can
>>>> >>>>>>>> use
>>>> >>>>>>>> our toolset for easy data analysis & visualization. Get a free
>>>> >>>>>>>> account!
>>>> >>>>>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>>> >>>>>>>> _______________________________________________
>>>> >>>>>>>> Dbpedia-gsoc mailing list
>>>> >>>>>>>> [email protected]
>>>> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>> >>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Kontokostas Dimitris
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> --
>>>> >>>>>> Best Regards,
>>>> >>>>>> Rahul Sharnagat
>>>> >>>>>> CSE MTech, IITB
>>>> >>>>>> H14, B505
>>>> >>>>>> +91.9860.451.056
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Best Regards,
>>>> >>>> Rahul Sharnagat
>>>> >>>> CSE MTech, IITB
>>>> >>>> H14, B505
>>>> >>>> +91.9860.451.056
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Kontokostas Dimitris
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Best Regards,
>>>> >> Rahul Sharnagat
>>>> >> CSE MTech, IITB
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best Regards,
>>>> > Rahul Sharnagat
>>>> > CSE MTech, IITB
>>>> > H14, B505
>>>> > +91.9860.451.056
>>>> >
>>>> >
>>>> > ------------------------------------------------------------------------------
>>>> > Try New Relic Now & We'll Send You this Cool Shirt
>>>> > New Relic is the only SaaS-based application performance monitoring
>>>> > service
>>>> > that delivers powerful full stack analytics. Optimize and monitor your
>>>> > browser, app, & servers with just a few lines of code. Try New Relic
>>>> > and get this awesome Nerd Life shirt!
>>>> > http://p.sf.net/sfu/newrelic_d2d_apr
>>>> > _______________________________________________
>>>> > Dbpedia-gsoc mailing list
>>>> > [email protected]
>>>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>>>> >
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Rahul Sharnagat
>>> CSE MTech, IITB
>>> H14, B505
>>> +91.9860.451.056
>>
>>
>>
>>
>> --
>> Kontokostas Dimitris
>
>
>
>
> --
> Best Regards,
> Rahul Sharnagat
> CSE MTech, IITB
> H14, B505
> +91.9860.451.056
>
>
>
> --
> Best Regards,
> Rahul Sharnagat
> CSE MTech, IITB
> H14, B505
> +91.9860.451.056

------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] GSoC : Crowdsource tests and extraction rules

Reply via email to