Does anyone know if DBPedia have plans to come with improvements to their logic for extracting abstracts?
The current algorithm fails on so many easy cases that it affects it's usefulness. Take this example: <http://dbpedia.org/resource/Queen_Silvia_of_Sweden> <http://www.w3.org/2000/01/rdf-schema#comment> "} |- | |- | |}"@en . The actual article is http://en.wikipedia.org/wiki/Queen_Silvia_of_Sweden . I could live with if a very few articles get misparsed and contain junk like that, but it simple fails on very easy examples like "Volvo": ":This article is about Volvo Group - AB Volvo; Volvo Cars is the luxury car maker owned by Ford Motor Company, using the Volvo Trademark." Looking at the actual page ( http://en.wikipedia.org/wiki/Volvo_Cars ) makes it easy to see that what it should have extracted is: "Volvo Cars, or Volvo Personvagnar, is a Swedish automobile maker founded in 1927 in the city of Gothenburg in Sweden." There are over 10000 articles that have got extracted like that: >>> $ grep "This article is about" articles_abstract_en.nt|wc -l >>>10552 Another 2000 articles contain "redirect messages such as": <http://dbpedia.org/resource/1995_Formula_One_season> <http://www.w3.org/2000/01/rdf-schema#comment> ":\"F1 1995\" redirects here. For the video games based on the 1995 Formula One season, see F1 95.|}"@en . Looking at the article it's easy to see that a better sentence to fetch is "The 1995 Formula One season was the 46th FIA Formula One World Championship season...". >>> $ grep "redirects here." articles_abstract_en.nt|wc -l >>> 1934 I don't think it should be too complicated to avoid getting junk by just looking at strings such as "redirects here" or "This article is about". Does someone know if someone is working on improving on this for the next dump they gonna create. Or has someone written another better parser already that I can use (or just download the dump of what it has generated). ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion