Hi.
I'm sending them to the list.
Marko
Dne sreda 23 septembra 2009 je Jona Christopher Sahnwaldt napisal(a):
> > I wanted to send diffs to this list but I have mistakenly sent them only
> > to Jona Christopher Sahnwaldt :(. Should I sent them here to?
>
> Yes, send them to the list too. Thanks!
>
> Cheers,
> Christopher
>
> On Wed, Sep 23, 2009 at 09:18, Marko Burjek <[email protected]> wrote:
> > Hi
> >
> > Dne torek 22 septembra 2009 je Sebastian Hellmann napisal(a):
> >> Hi, if you want Slovenian URIs you can
> >> copy the config/dbpedia-dist.ini to config/dbpedia.ini
> >> and adjust some options like:
> >> language = sk
> >> dependsOnEnglishLangLink = false
> >> dbpedia_ns = http://sk.dbpedia.org/
> >> generateOWLAxiomAnnotations = false
> >> geobatchextraction = false
> >> geousedb = false
> >> persondataUseDB = false
> >> LiveMappingBased.useTemplateDb = false
> >
> > I have already done that, but thanks anyway. BTW Slovenia is sl, sk is
> > Slovakia ;).
> >
> > BTW is there any specific reason that persondataUseDB selects whole
> > article and then searches it for link to english one instead of looking
> > into langlinks table?
> >
> >> The last four turn off the database dependancies.
> >> Could you maybe send us your configuration of the extract_all.php
> >> script.
> >
> > Sorry I can't find that file in my source code folder.
> >
> >> We are currently working on enabling language specific extractions of
> >> DBpedia.
> >> The code seems to be ready, but we don't have time to test it.
> >> It would be nice if you could tell us about the adaptions you made as we
> >> are eager to include them into the code.
> >
> > I wanted to send diffs to this list but I have mistakenly sent them only
> > to Jona Christopher Sahnwaldt :(. Should I sent them here to?
> >
> >> Regards,
> >> Sebastian
> >
> > Thanks everyone for all information.
> >
> >> Jona Christopher Sahnwaldt schrieb:
> >> > Hi Marko,
> >> >
> >> > it's great that you're working on a Slovenian extraction! In which way
> >> > did you modify the extractors? Maybe we can add your changes to the
> >> > repository.
> >> >
> >> > The definitions given directly in Wikipedia will be used for the live
> >> > extraction (the group in Leipzig is working on that), while the
> >> > definitions in the files are
> >> > used to produce the dumps found on http://wiki.dbpedia.org/Downloads .
> >> >
> >> > mapping.xls and rules.xls were replaced by mapping.csv and rules.csv.
> >> > The first version of the CSV files contained the same data as the
> >> > Excel files, but going forward from there, we only updated the CSV
> >> > files.
> >> >
> >> > They use the same "format" - the columns have the same meanings as in
> >> > the Excel files. They are described in
> >> > dbpedia/ontology/docs/dbpedia_mapping.txt.
> >> >
> >> > When you open the CSV files with OpenOffice, you will be asked for the
> >> > character encoding, field separator and text delimiter used in the
> >> > file. Set the character encoding to UTF-8, the separator to ";"
> >> > (semicolon) and uncheck all other separators, and make sure that the
> >> > text delimiter is empty (default is a quote). Similarly in Excel.
> >> >
> >> > When you adapt the mappings for the Slovenian Wikipedia, make sure
> >> > that you only change the template URLs and template property names,
> >> > but not the class names and ontology properties.
> >> >
> >> > The main reason for replacing the .xls file was that working with a
> >> > binary format
> >> > like .xls is hard. Finding the differences between different versions
> >> > of such files is
> >> > almost impossible, as is writing scripts that parse them. Our scripts
> >> > that copy the mappings and rules to the database
> >> > (dbpedia/ontology/mapping_db.php and dbpedia/ontology/rules_db.php)
> >> > always worked on CSV files, which we had to export from OpenOffice or
> >> > Excel first. Now we can avoid this extra step.
> >> >
> >> > Cheers,
> >> > Christopher
> >> >
> >> > On Tue, Sep 22, 2009 at 00:05, Marko Burjek <[email protected]>
> >
> > wrote:
> >> >> Hello!
> >> >>
> >> >> I want to use dbpedia to parse Slovenian wiki. I fixed and updated
> >> >> most of the extractors, that they are more translation friendly and
> >> >> now I want to use mappingBasedExtractor. The problem is that
> >> >> mapping.xls and rules.xls which I wanted to use to create the
> >> >> ontology were deleted in revision 1441 with log message "No longer
> >> >> needed". I googled and found this topic
> >> >> <http://www.mail-archive.com/[email protected]
> >> >>/ms g00870.html> I want to know if I should use mapping.xls and
> >> >> rules.xls from previous revision and create ontology with them or
> >> >> wait for this new way of specifying mappings and how long would that
> >> >> probably be? I also see that *.csv files in ontology folder were
> >> >> updated even after *.xls files were removed but mapping.xls from last
> >> >> usefull revision is the same as one I donwnloaded in june.
> >> >>
> >> >> Best regards,
> >> >> Marko
> >> >
> >> > ----------------------------------------------------------------------
> >> >--- ----- Come build with us! The BlackBerry® Developer Conference
> >> > in SF, CA is the only developer event you need to attend this year.
> >> > Jumpstart your developing skills, take BlackBerry mobile applications
> >> > to market and stay ahead of the curve. Join us from November 9-12,
> >> > 2009. Register now! http://p.sf.net/sfu/devconf
> >> > _______________________________________________
> >> > Dbpedia-discussion mailing list
> >> > [email protected]
> >> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >
> > --
> > -----BEGIN GEEK CODE BLOCK-----
> > Version: 3.12
> > GCS d? s++:- a-- C++ UL P+ L+++ E--- W++ N+++ o K- w--
> > O-- M-- V- PS PE Y+ PGP+ t+ 5 X+ R* tv b+ DI+ D--
> > G- e h! r-- y--
> > ------END GEEK CODE BLOCK------
> >
> > -------------------------------------------------------------------------
> >----- Come build with us! The BlackBerry® Developer Conference in SF,
> > CA is the only developer event you need to attend this year. Jumpstart
> > your developing skills, take BlackBerry mobile applications to market and
> > stay ahead of the curve. Join us from November 9-12, 2009. Register
> > now! http://p.sf.net/sfu/devconf
> > _______________________________________________
> > Dbpedia-discussion mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
Index: GeoExtractor.php
===================================================================
--- GeoExtractor.php (revision 1554)
+++ GeoExtractor.php (working copy)
@@ -61,6 +61,11 @@
* {{coor title d|deg|NS|deg|EW[|parameters]}}
* {{coor title dm|deg|min|NS|deg|min|EW[|parameters]}}
* {{coor title dms|deg|min|sec|NS|deg|min|sec|EW[|parameters]}}
+ *
+ * {{koordinate dm|deg|min|NS|deg|min|EW[|parameters]}}
+ * {{koordinate dms|deg|min|sec|NS|deg|min|sec|EW[|parameters]}}
+ * {{koordinate dm v naslovu|deg|min|NS|deg|min|EW[|parameters]}}
+ * {{koordinate dms v naslovu|deg|min|sec|NS|deg|min|sec|EW[|parameters]}}
*
* {{coord|latitude|longitude[|parameters][|display=display]}}
* {{coord|dd|N/S|dd|E/W[|parameters][|display=display]}}
@@ -73,6 +78,7 @@
*/
static $knownTemplatesTitle = array(
'/\{\{coor (?:title|at) (?:d|dm|dms)\|([^}]+)\}\}/i',
+ '/\{\{koordinate (?:dm|dms)(?:\s+v naslovu)?\|([^}]+)\}\}/i', //Slovenian
'/\{\{coord\|([^}]+display=[^|}]*title[^}]*)\}\}/i',
'/\{\{Geolinks[^|}]*(?<!no-title)\|([^}]*)\}\}/i',
'/\{\{Mapit[^|}]*\|([^}]+)\}\}/i', /* redirect to Geolinks, always create titles */
Index: PersondataExtractor.php
===================================================================
--- PersondataExtractor.php (revision 1554)
+++ PersondataExtractor.php (working copy)
@@ -33,7 +33,7 @@
if(Options::getOption('persondataUseDB')){
$mySource = $WikiDB->getSource($Birthplacematch);
}
-
+ //print $mySource;
preg_match("/\[\[en:(.*)\]\]/", $mySource, $LangLinkmatch);
if(isset($LangLinkmatch[1]))
$BirthPlace = $LangLinkmatch[1];
@@ -62,19 +62,20 @@
$result->addTriple(
$this->getPageURI(),
RDFtriple::URI(FOAF_NAME,false),
- RDFtriple::Literal($PersonData['name'],null,"de"));
+ RDFtriple::Literal($PersonData['name'],null,$this->language));
+
}
if (isset($PersonData['givenname']) && $PersonData['givenname']!="") {
$result->addTriple(
$this->getPageURI(),
RDFtriple::URI(FOAF_GIVENNAME,false),
- RDFtriple::Literal($PersonData['givenname'],null,"de"));
+ RDFtriple::Literal($PersonData['givenname'],null,$this->language));
}
if (isset($PersonData['surname']) && $PersonData['surname']!="") {
$result->addTriple(
$this->getPageURI(),
RDFtriple::URI(FOAF_SURNAME,false),
- RDFtriple::Literal($PersonData['surname'],null,"de"));
+ RDFtriple::Literal($PersonData['surname'],null,$this->language));
}
if(isset($BirthPlace) && $BirthPlace != "")
@@ -130,7 +131,7 @@
$result->addTriple(
$this->getPageURI(),
RDFtriple::URI(DC_DESCRIPTION,false),
- RDFtriple::Literal($PersonData['description'],null,"de"));
+ RDFtriple::Literal($PersonData['description'],null,$this->language));
}
@@ -144,20 +145,48 @@
return $result;
}
+
public function extractPersondata($pageSource, $language)
{
if ($language == "en")
{
$PersondataName = "Persondata";
+ $langName="NAME";
+ $langBirthTime="DATE OF BIRTH";
+ $langBirthPlace="PLACE OF BIRTH";
+ $langDeathTime="DATE OF DEATH";
+ $langDeathPlace="PLACE OF DEATH";
+ $langDesc="SHORT DESCRIPTION";
+ $langAltName="ALTERNATIVE NAMES";
}
-
- if ($language == "de")
+ elseif ($language == "de")
{
$PersondataName = "Personendaten";
+ $langName="NAME";
+ $langBirthTime="GEBURTSDATUM";
+ $langBirthPlace="GEBURTSORT";
+ $langDeathTime="STERBEDATUM";
+ $langDeathPlace="STERBEORT";
+ $langDesc="KURZBESCHREIBUNG";
+ $langAltName="ALTERNATIVENAMEN";
+ }
+ elseif ($language == "sl")
+ {
+ $PersondataName = "(?:Osebni_podatki|Osebni podatki|Persondata)";
+ $langName="NAME";
+ $langBirthTime="DATE OF BIRTH";
+ $langBirthPlace="PLACE OF BIRTH";
+ $langDeathTime="DATE OF DEATH";
+ $langDeathPlace="PLACE OF DEATH";
+ $langDesc="SHORT DESCRIPTION";
+ $langAltName="ALTERNATIVE NAMES";
+
}
preg_match("/\{\{($PersondataName(?>[^{}]+)|(?R))*\}\}/", $pageSource, $match);
+ //print_r($match);
+ //print $pageSource;
if (count($match) == 0)
{
@@ -166,58 +195,68 @@
else
{
- preg_match_all("/\|\s*([A-Z]+)=(.*)/", $match[0], $props, PREG_SET_ORDER);
+ preg_match_all("/\|\s*([A-Z ]+)\s*=(.*)/", $match[0], $props, PREG_SET_ORDER);
+ //print_r($props);
$results = array();
foreach ($props as $keyvalue) {
- //echo $keyvalue[1] . ': ' . trim($keyvalue[2]) . "\n";
- if ($keyvalue[1]== "NAME")
+ //echo $keyvalue[1] . ': ' . trim($keyvalue[2]) . "\n\n";
+ if ($keyvalue[1]==$langName)
{
$results['name'] = addslashes(trim($keyvalue[2]));
}
- if ($keyvalue[1]== "ALTERNATIVNAMEN")
+ if ($keyvalue[1]==$langAltName)
{
$results['altname'] = addslashes(trim($keyvalue[2]));
}
- if ($keyvalue[1]== "KURZBESCHREIBUNG")
+ if ($keyvalue[1]==$langDesc)
{
//$PersonDesc = addslashes(preg_replace_callback("/\[\[([^|]*?)(\|.*?)?\]\]/", array(&$this, 'getLabelForLink'),trim($keyvalue[2])));
$results['description'] = preg_replace_callback("/\[\[([^|]*?)(\|.*?)?\]\]/",array(&$this, 'getLabelForLink'), trim($keyvalue[2]));
}
- if ($keyvalue[1]== "GEBURTSDATUM")
+ if ($keyvalue[1]==$langBirthTime)
{
$results['birthdate'] = $this->StringToDate(addslashes(trim($keyvalue[2])));
}
- if ($keyvalue[1]== "GEBURTSORT")
+ if ($keyvalue[1]==$langBirthPlace)
{
$results['birthplace'] = addslashes(trim($keyvalue[2]));
}
- if ($keyvalue[1]== "STERBEDATUM")
+ if ($keyvalue[1]==$langDeathTime)
{
$results['deathdate'] = $this->StringToDate(addslashes(trim($keyvalue[2])));
}
- if ($keyvalue[1]== "STERBEORT")
+ if ($keyvalue[1]==$langDeathPlace)
{
$results['deathplace'] = addslashes(trim($keyvalue[2]));
}
}
+
- if(isset($results['name'])) {
+ if(isset($results['name']) && substr_count($results['name'],' ')) {
preg_match_all("/^([^,]+),([^,]+)$/", $results['name'], $name, PREG_SET_ORDER);
+ //print_r($results);
+ //print "NAME\n";
+ //print_r($name);
+
if(isset($name[0][1]) && isset($name[0][2])) {
$results['surname'] = trim($name[0][1]);
$results['givenname'] = trim($name[0][2]);
+ }elseif(preg_match_all("/^(\S+)\s+(\S+)$/", $results['name'], $name, PREG_SET_ORDER) ) {
+ $results['surname'] = trim($name[0][2]);
+ $results['givenname'] = trim($name[0][1]);
+
} else
return null;
@@ -250,75 +289,81 @@
}
public function StringToDate($string)
-{
+ {
+
+ $langMonths=array(
+ 'sl' => array(
+ 'januar' => "01",
+ 'februar' => "02",
+ 'marec' => "03",
+ 'april' => "04",
+ 'maj' => "05",
+ 'junij' => "06",
+ 'julij' => "07",
+ 'avgust' => "08",
+ 'september' => "09",
+ 'oktober' => "10",
+ 'november' => "11",
+ 'december' => "12",
+ ),
+ 'de' => array(
+ 'januar' => "01",
+ 'februar' => "02",
+ 'märz' => "03",
+ 'april' => "04",
+ 'mai' => "05",
+ 'juni' => "06",
+ 'juli' => "07",
+ 'avgust' => "08",
+ 'september' => "09",
+ 'oktober' => "10",
+ 'november' => "11",
+ 'dezember' => "12",
+ ),
+ 'en' => array(
+ 'january' => "01",
+ 'february' => "02",
+ 'march' => "03",
+ 'april' => "04",
+ 'may' => "05",
+ 'june' => "06",
+ 'july' => "07",
+ 'august' => "08",
+ 'september' => "09",
+ 'october' => "10",
+ 'november' => "11",
+ 'december' => "12",
+ )
+
+ );
+
- preg_match_all("/\d\d?./", $string, $meinDatumTag, PREG_SET_ORDER);
+
+ preg_match_all("/(\d\d?)\D/", $string, $meinDatumTag, PREG_SET_ORDER);
preg_match_all("/[A-Z]*[a-z]+/", $string, $meinDatumMonat, PREG_SET_ORDER);
preg_match_all("/\d\d\d\d?/", $string, $meinDatumJahr, PREG_SET_ORDER);
$temp_Monat = "00";
- if(isset($meinDatumMonat[0][0])) {
- if ($meinDatumMonat[0][0] == "Januar")
- {
- $temp_Monat = "01";
+ if(isset($meinDatumMonat[0][0]) && !is_numeric($meinDatumMonat[0][0])) {
+ $temp_Monat = $langMonths[$this->language][strtolower($meinDatumMonat[0][0])];
+ if(is_null($temp_Monat)){
+ $temp_Monat = $langMonths['en'][strtolower($meinDatumMonat[0][0])];
+
}
- if ($meinDatumMonat[0][0] == "Februar")
- {
- $temp_Monat = "02";
}
- if ($meinDatumMonat[0][0] == "März")
- {
- $temp_Monat = "03";
- }
- if ($meinDatumMonat[0][0] == "April")
- {
- $temp_Monat = "04";
- }
- if ($meinDatumMonat[0][0] == "Mai")
- {
- $temp_Monat = "05";
- }
- if ($meinDatumMonat[0][0] == "Juni")
- {
- $temp_Monat = "06";
- }
- if ($meinDatumMonat[0][0] == "Juli")
- {
- $temp_Monat = "07";
- }
- if ($meinDatumMonat[0][0] == "August")
- {
- $temp_Monat = "08";
- }
- if ($meinDatumMonat[0][0] == "September")
- {
- $temp_Monat = "09";
- }
- if ($meinDatumMonat[0][0] == "Oktober")
- {
- $temp_Monat = "10";
- }
- if ($meinDatumMonat[0][0] == "November")
- {
- $temp_Monat = "11";
- }
- if ($meinDatumMonat[0][0] == "Dezember")
- {
- $temp_Monat = "12";
- }
- }
- //echo $meinDatumMonat[0][0];
- if ($temp_Monat == "00")
+ if ($temp_Monat == "00" || is_null($temp_Monat))
{
return null;
}
else
{
+ //print_r($meinDatumTag);
if(isset($meinDatumTag[0][0])) {
- $Tag = str_replace(".","",$meinDatumTag[0][0]);
+ $Tag = str_replace(".","",$meinDatumTag[0][1]);
+ $Tag=trim($Tag);
if (strlen($Tag)==1)
{
$Tag = "0" . $Tag;
Index: HomepageExtractor.php
===================================================================
--- HomepageExtractor.php (revision 1554)
+++ HomepageExtractor.php (working copy)
@@ -13,14 +13,16 @@
/**
* @desc template properties names commonly used for the official homepage. must be lower case.
*/
- var $knownHomepagePredicates = array('website', 'homepage', 'webpräsenz', 'web', 'site', 'siteweb', 'site web');
+ var $knownHomepagePredicates = array('website', 'homepage', 'webpräsenz', 'web', 'site', 'siteweb', 'site web', 'spletna_stran', 'stran');
/**
* @desc regex parts matching words commonly used for the official homepage
*/
var $knownPatterns = array('en' => 'official',
'de' => 'offizielle',
- 'fr' => 'officiel');
+ 'fr' => 'officiel',
+ 'sl' => 'Uradna spletna stran',
+ );
/**
@@ -29,6 +31,7 @@
var $externalLinkSections = array("en" => "External links?",
"de" => "Weblinks?",
"fr" => "(?:Lien externe|Liens externes|Liens et documents externes)",
+ "sl" => "Zunanje povezave", //:TODO: Look if that's all
);
@@ -69,7 +72,6 @@
$infoboxes = $this->getInfoboxes($pageSource);
foreach ($infoboxes[1] as $box) {
$boxProperties = $this->getBoxProperties($box);
-
foreach ($this->knownHomepagePredicates as $pred) {
if (isset($boxProperties[$pred])) {
@@ -201,9 +203,14 @@
return $variant;
}
}
+
+ $urlRegex='~\[(https?://\S+)\s?([^]]+)?\]~i';
+ if($this->language == "sl"){ //Slovenians has to be something special
+ $urlRegex='~\[?(https?://\S+)\s?([^]]+)?\]?~i'; //:TODO: Test if it is ok for other languages
+ }
// match external link using normal wiki syntax
- if (preg_match('~\[(https?://\S+)\s?([^]]+)?\]~i', $link, $pieces)) {
+ if (preg_match($urlRegex, $link, $pieces)) {
$url = $pieces[1];
if (count($pieces) == 3) {
Index: core/language_disambigs.php
===================================================================
--- core/language_disambigs.php (revision 1554)
+++ core/language_disambigs.php (working copy)
@@ -99,7 +99,7 @@
'scn' => array('Disambigua'),
'sdc' => array('Matessi innòmmu'),
'sk' => array('Rozlišovacia stránka'),
- 'sl' => array('RazloÄitev'),
+ 'sl' => array('razloÄitev','disambig', "RazloÄitev", "Disambig", "Priimek", "razloÄitev-kraj"),
'sq' => array('Kthjellim'),
'sr-ec' => array('ÐиÑезнаÑна одÑедниÑа'),
'sr-el' => array('ViÅ¡eznaÄna odrednica'),
Index: core/language_namespaces.php
===================================================================
--- core/language_namespaces.php (revision 1554)
+++ core/language_namespaces.php (working copy)
@@ -2,6 +2,7 @@
$GLOBALS['MEDIAWIKI_NAMESPACES'] = array(
'legal' => array(MW_CATEGORY_NAMESPACE, MW_TEMPLATE_NAMESPACE, MW_FILE_NAMESPACE, MW_FILEALTERNATIVE_NAMESPACE),
'en' => array(MW_CATEGORY_NAMESPACE => 'Category', MW_TEMPLATE_NAMESPACE=>'Template', MW_FILE_NAMESPACE=>'File', MW_FILEALTERNATIVE_NAMESPACE=>'Image'),
+ 'sl' => array(MW_CATEGORY_NAMESPACE => 'Kategorija', MW_TEMPLATE_NAMESPACE=>'Predloga', MW_FILE_NAMESPACE=>'Datoteka', MW_FILEALTERNATIVE_NAMESPACE=>'Slika'),
'de' => array(MW_CATEGORY_NAMESPACE => 'Kategorie', MW_TEMPLATE_NAMESPACE=>'Template', MW_FILE_NAMESPACE=>'Datei', MW_FILEALTERNATIVE_NAMESPACE=>'Bild')
);
------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion