Re: [Dbpedia-discussion] Bad Wikipedia abstracts

Jörg Schüppel Tue, 01 Jul 2008 10:47:50 -0700

Hi Omid,

>
> (1) Why does this process involve a MySQL database?
The DBpedia Scripts wont read the xml Files. All Data from the Wikipedia  
Dumps should be loaded in a MySQL Database first.
You can do this using the import.php Script in /importwiki.


For en Wikipedia you may start this with eg.:
  php import.php -c -d DOWNLOADPATH -ip 127.0.0.1 en DBHOST DBNAME DBUSER  
DBPASSWORD
-ip is your machine ip (helps if mwdumper throws exceptions as i remember)
-you can find these parameters calling php import.php in /importwiki

Since this is done youll have your Wikipedia Database on your Machine and  
you can start your first extraction.
(The import script downloads, unzips and writes the Dump to Database) -  
will need some disk space ;)

> (2) As my first project I want to improve on the abstract extractor
> (dbpedia/extraction/extractors/ShortAbstractExtractor.php). I do not
> want to generate anything but "articles_abstract_en.nt", so I want to
> disable everything except this particular module. How do I do this? I
> don't want all other components to run and take time.

First you have to rename the databaseconfig.php.dist to databaseconfig.php  
and put your Database Parameters in this file.
For starting an extraction I used the start.php ... just comment out the  
extractors you wont need. Extracting all Datasets should be done by  
extract.php

(I just copied out the Code for Shortabstracts on End of this Mail)

So i hope this helps. It has been a while since i used the Framework so it  
could be, that i forgot anything. Just let me know if its not running.

Jörg


start.php:

function __autoload($class_name) {
    if(preg_match('~^.*Extractor.*$~',$class_name)) require_once  
('extractors/'.$class_name.'.php');
    else
                if(preg_match('~^.*Destination.*$~',$class_name)) require_once  
('destinations/'.$class_name.'.php');
                else require_once $class_name . '.php'; 
}

$pageTitles = array("Google");  //will extract the Google Article - for all  
articles see original start.php

//Create a Extraction Job
$job = new ExtractionJob(
         new DatabaseWikipedia("en"),
        $pageTitles);
                
// Create ExtractionGroups for each Extractors
$groupShortAbstracts = new ExtractionGroup(new  
SimpleDumpDestination());       //SimpleDumpDestination will Output to Screen
$groupShortAbstracts->addExtractor(new ShortAbstractExtractor());
$job->addExtractionGroup($groupShortAbstracts);

//Execute the Extraction Job
$manager = new ExtractionManager();
$manager->execute($job);






-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Bad Wikipedia abstracts

Reply via email to