Marvin Lugair wrote:
> Hello,
>
> I would like to report back on my loading of dbpedia 3.2 into Open-Source 
> Virtuoso 5.0.9.
> The good news is that I was successful and have a local DBPedia to play with 
> now. Thanks to everyone for their input and suggestions on configuration 
> parameters!
>
> Marv
>
> ----------------
>
> Running Ubuntu 8.1 (intrepid)
> Kernel 2.6.27-7
> 8GB DDR2 RAM
> AMD Athlon 2.5ghz Dual core
>
> It took around 22 hours to import the core (21 files) and make a .db database 
> file out of them. The imported resulted in one dbpedia.db file that is about 
> 20-something GB in size.
> It typically takes a little an hour to start that database (load the .db file 
> in memory) and start the virtuoso process.
> As a reference:
> Time to load infobox_en.nt = 52 minutes
>
>
> Some of the parameters in my dbpedia.ini
>
> MaxCheckpointRemap              = 1000000
> MaxMemPoolSize                  = 0
> StopCompilerWhenXOverRunTime    = 1
> DefaultIsolation                = 2
> NumberOfBuffers                 = 550000
> MaxDirtyBuffers                 = 320000
>
>
> Files that had errors
> ---------------
> Three files did not load because of malformed URIs (about 500 of them across 
> the three files, 400-something lines were in the externallinks file). I tried 
> to reload these files with the ttlp_mt bit mask that ignores errors but it 
> did not work.
> I deleted the corresponding triples and reloaded. Bascially you lose those 
> triples. Someone needs to fix these in the DBPedia files.
>
>
> The three files with errors are:
>  1> homepage_en.nt
>  2> externallinks_en.nt
>  3> infobox-mappingbased-loose.nt 
> The URI's either had spaces, backslashes or even Korean characters (in one 
> case) in them. These files need cleaning up.
>
>
>
>
> Some questions
> ---------------------------------------
> * Why does short-abstracts take 4 hours to load though it is 982MB
> whereas long-abstracts took 2 hours to load though its size is 1.7 gigs?!
> The only difference is that short was loaded a few files after long... does 
> performance change as the database file (the one i am creating, dbpedia.db) 
> grows larger?
>
> * What is the best way to check for and delete duplicate triples in the 
> database?
>
> * Related to this last question, it seems the online dbpedia at 
> dbpedia.org/sparql gateway does not return duplicates over the webpage 
> interface. However it does return duplicates for the SAME query when 
> submitted through Jena. To duplicate this paste the following query in the 
> webpage:
>
> select ?s
> where {
> ?s
>  <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }
>
> This will return the following results in my web browser:
> http://dbpedia.org/resource/Bill_Cosby
> http://dbpedia.org/resource/Dick_Gregory
> http://dbpedia.org/resource/Eddie_Murphy
> http://dbpedia.org/resource/Flip_Wilson
> http://dbpedia.org/resource/George_Carlin
> http://dbpedia.org/resource/Mort_Sahl
> http://dbpedia.org/resource/Redd_Foxx
> http://dbpedia.org/resource/Richard_Pryor
> http://dbpedia.org/resource/Rodney_Dangerfield
> http://dbpedia.org/resource/Sam_Kinison
> http://dbpedia.org/resource/Steve_Martin
>
>
> no duplicates, 
> Now run the *same* query through a Jena program
> In my java source here is how I am connecting to what I assume is the SAME 
> gateway!
>  QueryExecution qexec = 
> QueryExecutionFactory.sparqlService("http://DBpedia.org/sparql";, q);
>
> and here is what i get (again this is the exact same query):
>
> ----------------------------------------------------
> | s                                                |
> ====================================================
> | <http://dbpedia.org/resource/Bill_Cosby>         |
> | <http://dbpedia.org/resource/Dick_Gregory>       |
> | <http://dbpedia.org/resource/Eddie_Murphy>       |
> | <http://dbpedia.org/resource/Flip_Wilson>        |
> | <http://dbpedia.org/resource/George_Carlin>      |
> | <http://dbpedia.org/resource/Mort_Sahl>          |
> | <http://dbpedia.org/resource/Redd_Foxx>          |
> | <http://dbpedia.org/resource/Richard_Pryor>      |
> | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> | <http://dbpedia.org/resource/Sam_Kinison>        |
> | <http://dbpedia.org/resource/Steve_Martin>       |
> | <http://dbpedia.org/resource/Bill_Cosby>         |
> | <http://dbpedia.org/resource/Bill_Cosby>         |
> | <http://dbpedia.org/resource/Dick_Gregory>       |
> | <http://dbpedia.org/resource/Eddie_Murphy>       |
> | <http://dbpedia.org/resource/Flip_Wilson>        |
> | <http://dbpedia.org/resource/George_Carlin>      |
> | <http://dbpedia.org/resource/Mort_Sahl>          |
> | <http://dbpedia.org/resource/Redd_Foxx>          |
> | <http://dbpedia.org/resource/Richard_Pryor>      |
> | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> | <http://dbpedia.org/resource/Sam_Kinison>        |
> | <http://dbpedia.org/resource/Steve_Martin>       |
> | <http://dbpedia.org/resource/Eddie_Murphy>       |
> ----------------------------------------------------
>
> Duplicates!
> Can someone please explain this?
>
> As a side, when I run this from isql on my newly locally installed dbpedia I 
> get no duplicates (I havent tried Jena with my local).
>
>
> <eom>
>
>
>
>       
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>   
Marvin,

You will see why when you run:

select *
where {graph ?g {
?s
 <http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}


As you can see their are two graphs:
1. http://dbpedia.org
2. http://dbpedia.org/resource/<entity> (this one results from cache 
activity associated with client interactions with Virtuoso)

Solutions:
-- Being specific about source Graph by specifying Graph IRI
select ?s
where {graph <http://dbpedia.org> {
?s
 <http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}

OR

select ?s
from <http://dbpedia.org>
where {
?s
 <http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}

-- Using DISTINCT

select distinct ?s
where {
?s
 <http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}

-- 


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software     Web: http://www.openlinksw.com





-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to