Or we run DEF extraction on Hadoop. :)

Another task idea?

Cheers,
Pablo



On Tue, Apr 16, 2013 at 4:34 PM, Joachim Daiber <[email protected]>wrote:

> Hey,
>
> so far, we download the Wikipedia dumps straight into HDFS. For the
> DBpedia extraction, we would store the dumps locally first, so we can use
> any directory structure that makes it easier.
>
> Best,
> Jo
>
>
> On Tue, Apr 16, 2013 at 4:19 PM, Jona Christopher Sahnwaldt <
> [email protected]> wrote:
>
>>
>> On Apr 16, 2013 3:45 PM, "Dimitris Kontokostas" <[email protected]>
>> wrote:
>> >
>> > Hi Jo,
>> >
>> > This is a good interdisciplinary task ;)
>> >
>> > About the extraction script, DBpedia now uses a predefined folder
>> structure for locating dumps / extracting data and follows the wIkipedia
>> dumps structure [1].
>> >
>> > There are two options here
>> > 1) Spotlight adapts the configuration to accommodate that
>> > 2) DBpedia makes the dump easier to run with arbitrary mediawiki dumps
>> and output folders.
>> >
>> > Maybe (1) is a lot easier but I'd vote for (2). ;)
>> > For (2) what we need is to create 2 new scripts for download / extract
>> that will be based on [2] & [3].
>> > Once we have a volunteer we can discuss this in detail
>>
>> If the desired new folder/file name structure is reasonably similar, we
>> don't really need to create new scripts, we basically need to turn
>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Finder.scalainto
>>  an interface and provide different implementations: one is the current
>> finder, the other would be a new one. Finder.scala already is a Strategy
>> pattern, now we just have to make it configurable.
>>
>> >
>> > Cheers,
>> > Dimitris
>> >
>> >
>> > [1] http://dumps.wikimedia.org/
>> > [2]
>> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/extract/Extraction.scala
>> > [3]
>> https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/scala/org/dbpedia/extraction/dump/download/Download.scala
>> >
>> >
>> > On Tue, Apr 16, 2013 at 1:29 PM, Joachim Daiber <
>> [email protected]> wrote:
>> >>
>> >> Hey all,
>> >>
>> >> I added this task to the Spotlight ideas, it's smallish, so it's maybe
>> more of a warm-up task:
>> >>
>> >> ----
>> >>
>> >> For creating Spotlight models, we need instance_types.nt, redirects.nt
>> and disambiguations.nt. Since we want these to be from the same Wikipedia
>> dump as the one from which we create the model, integrate the DBpedia
>> extraction into the index_db.sh script in DBpedia Spotlight, so that the
>> files are automatically produced during indexing.
>> >>
>> >> ----
>> >>
>> >> Maybe somebody who knows DEF better than I could comment on how
>> complicated this would be to do. We have the Wikipedia dump and we need
>> redirects, disambiguation pages and instance types for this version of the
>> dump.
>> >>
>> >> Best,
>> >> Jo
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> Precog is a next-generation analytics platform capable of advanced
>> >> analytics on semi-structured data. The platform includes APIs for
>> building
>> >> apps and a phenomenal toolset for data science. Developers can use
>> >> our toolset for easy data analysis & visualization. Get a free account!
>> >> http://www2.precog.com/precogplatform/slashdotnewsletter
>> >> _______________________________________________
>> >> Dbpedia-gsoc mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>> >>
>> >
>> >
>> >
>> > --
>> > Kontokostas Dimitris
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Precog is a next-generation analytics platform capable of advanced
>> > analytics on semi-structured data. The platform includes APIs for
>> building
>> > apps and a phenomenal toolset for data science. Developers can use
>> > our toolset for easy data analysis & visualization. Get a free account!
>> > http://www2.precog.com/precogplatform/slashdotnewsletter
>> > _______________________________________________
>> > Dbpedia-gsoc mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>> >
>>
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Dbpedia-gsoc mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc
>
>


-- 

Pablo N. Mendes
http://pablomendes.com
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to