Hey,

so far, we download the Wikipedia dumps straight into HDFS. For the DBpedia
extraction, we would store the dumps locally first, so we can use any
directory structure that makes it easier.

Best,
Jo


On Tue, Apr 16, 2013 at 4:19 PM, Jona Christopher Sahnwaldt <[email protected]
> wrote:

>
> On Apr 16, 2013 3:45 PM, "Dimitris Kontokostas" <[email protected]> wrote:
> >
> > Hi Jo,
> >
> > This is a good interdisciplinary task ;)
> >
> > About the extraction script, DBpedia now uses a predefined folder
> structure for locating dumps / extracting data and follows the wIkipedia
> dumps structure [1].
> >
> > There are two options here
> > 1) Spotlight adapts the configuration to accommodate that
> > 2) DBpedia makes the dump easier to run with arbitrary mediawiki dumps
> and output folders.
> >
> > Maybe (1) is a lot easier but I'd vote for (2). ;)
> > For (2) what we need is to create 2 new scripts for download / extract
> that will be based on [2] & [3].
> > Once we have a volunteer we can discuss this in detail
>
> If the desired new folder/file name structure is reasonably similar, we
> don't really need to create new scripts, we basically need to turn
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Finder.scalainto
>  an interface and provide different implementations: one is the current
> finder, the other would be a new one. Finder.scala already is a Strategy
> pattern, now we just have to make it configurable.
>
> >
> > Cheers,
> > Dimitris
> >
> >
> > [1] http://dumps.wikimedia.org/
> > [2]
> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/Extraction.scala
> > [3]
> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/download/Download.scala
> >
> >
> > On Tue, Apr 16, 2013 at 1:29 PM, Joachim Daiber <
> [email protected]> wrote:
> >>
> >> Hey all,
> >>
> >> I added this task to the Spotlight ideas, it's smallish, so it's maybe
> more of a warm-up task:
> >>
> >> ----
> >>
> >> For creating Spotlight models, we need instance_types.nt, redirects.nt
> and disambiguations.nt. Since we want these to be from the same Wikipedia
> dump as the one from which we create the model, integrate the DBpedia
> extraction into the index_db.sh script in DBpedia Spotlight, so that the
> files are automatically produced during indexing.
> >>
> >> ----
> >>
> >> Maybe somebody who knows DEF better than I could comment on how
> complicated this would be to do. We have the Wikipedia dump and we need
> redirects, disambiguation pages and instance types for this version of the
> dump.
> >>
> >> Best,
> >> Jo
> >>
> >>
> ------------------------------------------------------------------------------
> >> Precog is a next-generation analytics platform capable of advanced
> >> analytics on semi-structured data. The platform includes APIs for
> building
> >> apps and a phenomenal toolset for data science. Developers can use
> >> our toolset for easy data analysis & visualization. Get a free account!
> >> http://www2.precog.com/precogplatform/slashdotnewsletter
> >> _______________________________________________
> >> Dbpedia-gsoc mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
> >>
> >
> >
> >
> > --
> > Kontokostas Dimitris
> >
> >
> ------------------------------------------------------------------------------
> > Precog is a next-generation analytics platform capable of advanced
> > analytics on semi-structured data. The platform includes APIs for
> building
> > apps and a phenomenal toolset for data science. Developers can use
> > our toolset for easy data analysis & visualization. Get a free account!
> > http://www2.precog.com/precogplatform/slashdotnewsletter
> > _______________________________________________
> > Dbpedia-gsoc mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
> >
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to