Hi, While using the Dbpedia Extraction Framework over some of the pages, the infobox extractor doesn't parses dates correctly. This happens in only few of the pages. I sampled some pages to find a pattern. The following are the different date formats that are typically used in the wikimarkup
- January 8, 1997 - |2005|1|3 - 26 June 2012 The first two are intercepted properly by DBPedia but the third one is not. An example page for the third kind is http://en.wikipedia.org/wiki/Manmohan_Singh. The erroneous lines generated as output are as follows. <http://dbpedia.org/resource/Manmohan_Singh> <http://dbpedia.org/property/termStart> "26"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://dbpedia.org/resource/Manmohan_Singh> <http://dbpedia.org/property/termEnd> "31"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://dbpedia.org/resource/Manmohan_Singh> <http://dbpedia.org/property/predecessor> <http://dbpedia.org/resource/Pranab_Mukherjee> . Following is the code snippet from InfoBox Extractor which deals with this. -------------- private def extractValue(node : PropertyNode) : List[(String, Datatype)] = { // TODO don't convert to SI units (what happens to {{convert|25|kg}} ?) extractUnitValue(node).foreach(result => return List(result)) extractNumber(node).foreach(result => return List(result)) extractRankNumber(node).foreach(result => return List(result)) extractDates(node) match { case dates if !dates.isEmpty => return dates case _ => } extractLinks(node) match { case links if !links.isEmpty => return links case _ => } StringParser.parse(node).map(value => (value, new Datatype("xsd:string"))).toList } ---------------- After some runs with the debugger, it seems that '26 June 2012', is being captured as a Number before it could be tested as a Date. If we move the extractDates above extractNumber if works. I'm attaching a patch with the said changes. I have tested in on the above mentioned Manmohan Singh Page. It fixes this dates parsing bug, but I don't know if this change will affect some other use case. Can someone look and tell me if the change is correct. Also what is the process of committing any fixes in the code ? Regards Amit Kumar
infobox_fix_dates.diff
Description: infobox_fix_dates.diff
------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
