Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Pablo Mendes Wed, 23 Nov 2011 01:14:24 -0800

Hi Amit,
Thanks for your interest in DBpedia. Most of my effort has gone into
DBpedia Spotlight, but I can try to help with the DBpedia Extraction
Framework as well. Maybe the core developers can chip in if I misrepresent
somewhere.


1) [more docs]


I am unaware.


2) [typo in config]


Seems ok.


3) ... Am I right ? Does the framework work on any particular dump of
> Wikipedia? Also what goes in the commons branch ?


Yes. As far as I can tell, you're right. But there is no particular dump.
You just need to follow the convention for the directory structure. The
commons directory has a similar structure, see:

wikipediaDump/commons/20110729/commonswiki-20110729-pages-articles.xml

I think this file is only used by the image extractor and maybe a couple of
others. Maybe it should be only mandatory if the corresponding extractors
are included in the config. But it's likely nobody got around to
implementing that catch yet.


4) It seems the AbstractExtractor requires an instance of Mediawiki running
> to parse mediawiki syntax. ... Can someone shed some more light on this ?
> What customization is required ? Where can I get one ?


The abstract extractor is used to render inline templates, as many articles
start with automatically generated content from templates. See:
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/945c24bdc54c/abstractExtraction



Also another question: Is there a reason for the delay in subsequent
> Dbpedia releases ? I was wondering , if the code is already there, why does
> it take 6 months between Dbpedia releases? Is there a manual editorial
>  involved or is it due  to development/changes  in the framework code which
> are collated in every release?


One reason might be that a lot of the value in DBpedia comes from manually
generated "homogenization" in mappings.dbpedia.org. That, plus getting a
stable version of the framework tested and run would probably explain the
choice of periodicity.

Best,
Pablo

On Tue, Nov 22, 2011 at 12:03 PM, Amit Kumar <[email protected]> wrote:

>
> Hey everyone,
> I’m trying to setup the Dbpedia extraction framework as I’m interested in
> getting structured data from already downloaded wikipedia dumps.  As per my
> understanding  I need to work in the ‘dump’ directory of the codebase. I
> have tried to reverse engineer ( given scala is new for me) but I need some
> help.
>
>
>    1. First of all, is there a more detailed documentation somewhere
>    about setting and running the pipeline. The one available on
>    dbpedia.org seems insufficient.
>    2. I understand that I need to create a config.properties file first
>    where I need to setup input/output locations, list of extractors and the
>    languages. I tried working with the config.properties.default given in the
>    code. There seems to be some typo in the extractor list.
>    ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractorExtractor’
>    using this gives ‘class not found’ error. I converted it to
>    ‘org.dbpedia.extraction.mappings.InterLanguageLinksExtractor’. Is it ok ?
>    3. I can’t find the documentation on how to setup the input directory.
>    Can someone tell me the details? From what I gather, input directory should
>    contain a ‘commons’ directory plus, directory for all languages set in
>    config.properties. All these directories must have a subdirectory whose
>    name should be of YYYYMMDD format. Within that you save the xml files such
>    as enwiki-20111111-pages-articles.xml. Am I right ? Does the framework work
>    on any particular dump of Wikipedia? Also what goes in the commons branch ?
>    4. I ran the framework by copying a sample dump
>    
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2in
>  both en and commons branch. Unzipping them and renaming as per
>    requirement. For now I’m working with en language only. It works with the
>    default 19 extractors but, starts failing if I include *
>    AbstractExtractor.* It seems the AbstractExtractor requires an
>    instance of Mediawiki running to parse mediawiki syntax. From the file
>    itself, “*DBpedia-customized MediaWiki instance is required*.” Can
>    someone shed some more light on this ? What customization is required ?
>    Where can I get one ?
>
>
>
> Sorry if the question are too basic and already mentioned somewhere. I
> have tried looking but couldn’t find myself.
> Also another question: Is there a reason for the delay in subsequent
> Dbpedia releases ? I was wondering , if the code is already there, why does
> it take 6 months between Dbpedia releases? Is there a manual editorial
>  involved or is it due  to development/changes  in the framework code which
> are collated in every release?
>
>
> Thanks and regards,
>
> Amit
> Tech Lead
> Cloud and Platform Group
> Yahoo!
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Help with Dbpedia Extraction Framework

Reply via email to