Does anyone know if DBPedia have plans to come with improvements to
their logic for extracting abstracts?

The current algorithm fails on so many easy cases that it affects it's
usefulness.

Take this example:
<http://dbpedia.org/resource/Queen_Silvia_of_Sweden>
<http://www.w3.org/2000/01/rdf-schema#comment> "} |- | |- | |}"@en .

The actual article is http://en.wikipedia.org/wiki/Queen_Silvia_of_Sweden .

I could live with if a very few articles get misparsed and contain
junk like that, but it simple fails on very easy examples like
"Volvo":

":This article is about Volvo Group - AB Volvo; Volvo Cars is the
luxury car maker owned by Ford Motor Company, using the Volvo
Trademark."

Looking at the actual page ( http://en.wikipedia.org/wiki/Volvo_Cars )
makes it easy to see that what it should have extracted is:

"Volvo Cars, or Volvo Personvagnar, is a Swedish automobile maker
founded in 1927 in the city of Gothenburg in Sweden."

There are over 10000 articles that have got extracted like that:
>>> $ grep "This article is about" articles_abstract_en.nt|wc -l
>>>10552

Another 2000 articles contain "redirect messages such as":
<http://dbpedia.org/resource/1995_Formula_One_season>
<http://www.w3.org/2000/01/rdf-schema#comment> ":\"F1 1995\" redirects
here. For the video games based on the 1995 Formula One season, see F1
95.|}"@en .

Looking at the article it's easy to see that a better sentence to
fetch is "The 1995 Formula One season was the 46th FIA Formula One
World Championship season...".

>>> $ grep "redirects here." articles_abstract_en.nt|wc -l
>>> 1934


I don't think it should be too complicated to avoid getting junk by
just looking at strings such as "redirects here" or "This article is
about".
Does someone know if someone is working on improving on this for the
next dump they gonna create.
Or has someone written another better parser already that I can use
(or just download the dump of what it has generated).

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to