Re: [Dbpedia-discussion] Bad Wikipedia abstracts

Georgi Kobilarov Fri, 02 May 2008 11:13:42 -0700

Hi Omid,

you are right, the abstracts' quality needs improvement. We have plans
to have better abstracts :) But nobody is working on it at the moment or
will do in the very near future.


As you might know, the DBpedia extraction framework is open source, and
we highly welcome contributions!

You can find the Abstract extractor at
http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/extraction/extractors/
ShortAbstractExtractor.php?revision=468&view=markup

If you have any question regarding the framework, please feel free to
send me a message.

Cheers,
Georgi

--
Georgi Kobilarov
Freie Universität Berlin
www.georgikobilarov.com


> -----Original Message-----
> From: [EMAIL PROTECTED]
[mailto:dbpedia-
> [EMAIL PROTECTED] On Behalf Of Omid Rouhani
> Sent: Friday, May 02, 2008 8:03 PM
> To: dbpedia-discussion@lists.sourceforge.net
> Subject: [Dbpedia-discussion] Bad Wikipedia abstracts
> 
> Does anyone know if DBPedia have plans to come with improvements to
> their logic for extracting abstracts?
> 
> The current algorithm fails on so many easy cases that it affects it's
> usefulness.
> 
> Take this example:
> <http://dbpedia.org/resource/Queen_Silvia_of_Sweden>
> <http://www.w3.org/2000/01/rdf-schema#comment> "} |- | |- | |}"@en .
> 
> The actual article is
> http://en.wikipedia.org/wiki/Queen_Silvia_of_Sweden .
> 
> I could live with if a very few articles get misparsed and contain
> junk like that, but it simple fails on very easy examples like
> "Volvo":
> 
> ":This article is about Volvo Group - AB Volvo; Volvo Cars is the
> luxury car maker owned by Ford Motor Company, using the Volvo
> Trademark."
> 
> Looking at the actual page ( http://en.wikipedia.org/wiki/Volvo_Cars )
> makes it easy to see that what it should have extracted is:
> 
> "Volvo Cars, or Volvo Personvagnar, is a Swedish automobile maker
> founded in 1927 in the city of Gothenburg in Sweden."
> 
> There are over 10000 articles that have got extracted like that:
> >>> $ grep "This article is about" articles_abstract_en.nt|wc -l
> >>>10552
> 
> Another 2000 articles contain "redirect messages such as":
> <http://dbpedia.org/resource/1995_Formula_One_season>
> <http://www.w3.org/2000/01/rdf-schema#comment> ":\"F1 1995\" redirects
> here. For the video games based on the 1995 Formula One season, see F1
> 95.|}"@en .
> 
> Looking at the article it's easy to see that a better sentence to
> fetch is "The 1995 Formula One season was the 46th FIA Formula One
> World Championship season...".
> 
> >>> $ grep "redirects here." articles_abstract_en.nt|wc -l
> >>> 1934
> 
> 
> I don't think it should be too complicated to avoid getting junk by
> just looking at strings such as "redirects here" or "This article is
> about".
> Does someone know if someone is working on improving on this for the
> next dump they gonna create.
> Or has someone written another better parser already that I can use
> (or just download the dump of what it has generated).
> 
>
-----------------------------------------------------------------------
> --
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save
$100.
> Use priority code J8TL2D2.
>
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/
> javaone
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Bad Wikipedia abstracts

Reply via email to