Thanks Alexander
I think it would really be great if there would be an additional step to
the 'extraction' framework which would basically remove invalid triples.
There will always be some errors of this kind in Wikipedia and the only way
I see it solved in Dbpedia is to check each triple and then publish only
the valid entries.
Cheers
Danica
On 5 December 2011 16:16, Aleksander Pohl <[email protected]> wrote:
> 05.12.2011 16:58, Danica Damljanovic:
> > Hi everybody
> >
> > I wanted to create my own local repository for English version of
> > dbpedia 3.7 and downloaded all files
> > from http://downloads.dbpedia.org/3.7/en/
> >
> > (I am using the latest sesame and owlim to do this)
> >
> > It seems that mappingbased_properties_en.nt file has many problems, and
> > the main one is that non-URI strings are between "<" and ">" making the
> > parser throw an exception as things such as
> > <?>
> > are not a valid URI.
> >
> > Other examples are many URLs that do not start with http:// but still
> > are between "<" and ">" such as in
> > <http://dbpedia.org/resource/Lee_County,_Florida>
> > <http://xmlns.com/foaf/0.1/homepage> <www.lee-county.com/
> > <http://www.lee-county.com/>> . (line 164 177)
> >
> > I wonder what is the best way to solve this problem. I was optimistic at
> > the beginning and started doing 'replace all' to correct some of the
> > URIs but then it turns out that the problem can not really be reduced to
> > a bunch of patterns easily. for example, there are 'links' to webpages
> > which do not even start with 'www' but they are clearly a URL (e.g.
> > <ayat-algormezi.blogspot.com <http://ayat-algormezi.blogspot.com>>). One
> > common exception is this:
> I had similar problem with the Polish DBpedia - the problem is in the
> source data, i.e. - some of the links are invalid in Wikipedia. I don't
> have any - replace_it_all solution, but you should file a bug for
> DBpedia extractor demanding to check if the extracted URIs are valid and
> not passing them to the output, if they are not.
>
> Regarding a short-cut solution to your problem - I would pass the
> invalid input data via a script in Python or Ruby and leave only these
> entries which have valid URIs.
>
> Cheers,
> Aleksander
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
--
Best,
Danica Damljanovic, PhD
Research Associate
GATE team http://gate.ac.uk
Natural Language Processing Group
University of Sheffield
http://www.dcs.shef.ac.uk/~danica
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion