Hi,
While using the Dbpedia Extraction Framework over some of the pages, the 
infobox extractor doesn't parses dates correctly. This happens in only few of 
the pages. I sampled some pages to find a pattern. The following are the 
different date formats that are typically used in the wikimarkup

- January 8, 1997
- |2005|1|3
- 26 June 2012

The first two are intercepted properly by DBPedia but the third one is not. An 
example page for the third kind is http://en.wikipedia.org/wiki/Manmohan_Singh. 
The erroneous lines generated as output are as follows.

<http://dbpedia.org/resource/Manmohan_Singh> 
<http://dbpedia.org/property/termStart> 
"26"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/Manmohan_Singh> 
<http://dbpedia.org/property/termEnd> 
"31"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://dbpedia.org/resource/Manmohan_Singh> 
<http://dbpedia.org/property/predecessor> 
<http://dbpedia.org/resource/Pranab_Mukherjee> .


Following is the code snippet from InfoBox Extractor which deals with this.
--------------

private def extractValue(node : PropertyNode) : List[(String, Datatype)] =
    {
        // TODO don't convert to SI units (what happens to {{convert|25|kg}} ?)
        extractUnitValue(node).foreach(result => return List(result))
        extractNumber(node).foreach(result =>  return List(result))
        extractRankNumber(node).foreach(result => return List(result))
        extractDates(node) match
        {
            case dates if !dates.isEmpty => return dates
            case _ =>
        }
        extractLinks(node) match
        {
            case links if !links.isEmpty => return links
            case _ =>
        }
        StringParser.parse(node).map(value => (value, new 
Datatype("xsd:string"))).toList
    }
----------------

After some runs with the debugger, it seems that '26 June 2012', is being 
captured as a Number before it could be tested as a Date. If we move the 
extractDates above extractNumber if works. I'm attaching a patch with the said 
changes. I have tested in on the  above mentioned Manmohan Singh Page. It fixes 
this dates parsing bug, but I don't know if this change will affect some other 
use case. Can someone look and tell me if the change is correct. Also what is 
the process of committing any fixes in the code ?

Regards
Amit Kumar

Attachment: infobox_fix_dates.diff
Description: infobox_fix_dates.diff

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to