2008/11/23 Kingsley Idehen <[EMAIL PROTECTED]>
> Marvin Lugair wrote:
> > Hello,
> >
> > I would like to report back on my loading of dbpedia 3.2 into Open-Source
> Virtuoso 5.0.9.
> > The good news is that I was successful and have a local DBPedia to play
> with now. Thanks to everyone for their input and suggestions on
> configuration parameters!
> >
> > Marv
> >
> > ----------------
> >
> > Running Ubuntu 8.1 (intrepid)
> > Kernel 2.6.27-7
> > 8GB DDR2 RAM
> > AMD Athlon 2.5ghz Dual core
> >
> > It took around 22 hours to import the core (21 files) and make a .db
> database file out of them. The imported resulted in one dbpedia.db file that
> is about 20-something GB in size.
> > It typically takes a little an hour to start that database (load the .db
> file in memory) and start the virtuoso process.
> > As a reference:
> > Time to load infobox_en.nt = 52 minutes
> >
> >
> > Some of the parameters in my dbpedia.ini
> >
> > MaxCheckpointRemap = 1000000
> > MaxMemPoolSize = 0
> > StopCompilerWhenXOverRunTime = 1
> > DefaultIsolation = 2
> > NumberOfBuffers = 550000
> > MaxDirtyBuffers = 320000
> >
> >
> > Files that had errors
> > ---------------
> > Three files did not load because of malformed URIs (about 500 of them
> across the three files, 400-something lines were in the externallinks file).
> I tried to reload these files with the ttlp_mt bit mask that ignores errors
> but it did not work.
> > I deleted the corresponding triples and reloaded. Bascially you lose
> those triples. Someone needs to fix these in the DBPedia files.
> >
> >
> > The three files with errors are:
> > 1> homepage_en.nt
> > 2> externallinks_en.nt
> > 3> infobox-mappingbased-loose.nt
> > The URI's either had spaces, backslashes or even Korean characters (in
> one case) in them. These files need cleaning up.
> >
> >
> >
> >
> > Some questions
> > ---------------------------------------
> > * Why does short-abstracts take 4 hours to load though it is 982MB
> > whereas long-abstracts took 2 hours to load though its size is 1.7 gigs?!
> > The only difference is that short was loaded a few files after long...
> does performance change as the database file (the one i am creating,
> dbpedia.db) grows larger?
> >
> > * What is the best way to check for and delete duplicate triples in the
> database?
> >
> > * Related to this last question, it seems the online dbpedia at
> dbpedia.org/sparql gateway does not return duplicates over the webpage
> interface. However it does return duplicates for the SAME query when
> submitted through Jena. To duplicate this paste the following query in the
> webpage:
> >
> > select ?s
> > where {
> > ?s
> > <http://dbpedia.org/property/influenced>
> > <http://dbpedia.org/resource/Chris_Rock>
> > }
> >
> > This will return the following results in my web browser:
> > http://dbpedia.org/resource/Bill_Cosby
> > http://dbpedia.org/resource/Dick_Gregory
> > http://dbpedia.org/resource/Eddie_Murphy
> > http://dbpedia.org/resource/Flip_Wilson
> > http://dbpedia.org/resource/George_Carlin
> > http://dbpedia.org/resource/Mort_Sahl
> > http://dbpedia.org/resource/Redd_Foxx
> > http://dbpedia.org/resource/Richard_Pryor
> > http://dbpedia.org/resource/Rodney_Dangerfield
> > http://dbpedia.org/resource/Sam_Kinison
> > http://dbpedia.org/resource/Steve_Martin
> >
> >
> > no duplicates,
> > Now run the *same* query through a Jena program
> > In my java source here is how I am connecting to what I assume is the
> SAME gateway!
> > QueryExecution qexec = QueryExecutionFactory.sparqlService("
> http://DBpedia.org/sparql", q);
> >
> > and here is what i get (again this is the exact same query):
> >
> > ----------------------------------------------------
> > | s |
> > ====================================================
> > | <http://dbpedia.org/resource/Bill_Cosby> |
> > | <http://dbpedia.org/resource/Dick_Gregory> |
> > | <http://dbpedia.org/resource/Eddie_Murphy> |
> > | <http://dbpedia.org/resource/Flip_Wilson> |
> > | <http://dbpedia.org/resource/George_Carlin> |
> > | <http://dbpedia.org/resource/Mort_Sahl> |
> > | <http://dbpedia.org/resource/Redd_Foxx> |
> > | <http://dbpedia.org/resource/Richard_Pryor> |
> > | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> > | <http://dbpedia.org/resource/Sam_Kinison> |
> > | <http://dbpedia.org/resource/Steve_Martin> |
> > | <http://dbpedia.org/resource/Bill_Cosby> |
> > | <http://dbpedia.org/resource/Bill_Cosby> |
> > | <http://dbpedia.org/resource/Dick_Gregory> |
> > | <http://dbpedia.org/resource/Eddie_Murphy> |
> > | <http://dbpedia.org/resource/Flip_Wilson> |
> > | <http://dbpedia.org/resource/George_Carlin> |
> > | <http://dbpedia.org/resource/Mort_Sahl> |
> > | <http://dbpedia.org/resource/Redd_Foxx> |
> > | <http://dbpedia.org/resource/Richard_Pryor> |
> > | <http://dbpedia.org/resource/Rodney_Dangerfield> |
> > | <http://dbpedia.org/resource/Sam_Kinison> |
> > | <http://dbpedia.org/resource/Steve_Martin> |
> > | <http://dbpedia.org/resource/Eddie_Murphy> |
> > ----------------------------------------------------
> >
> > Duplicates!
> > Can someone please explain this?
> >
> > As a side, when I run this from isql on my newly locally installed
> dbpedia I get no duplicates (I havent tried Jena with my local).
> >
> >
> > <eom>
> >
> >
> >
> >
> >
> > -------------------------------------------------------------------------
> > This SF.Net email is sponsored by the Moblin Your Move Developer's
> challenge
> > Build the coolest Linux based applications with Moblin SDK & win great
> prizes
> > Grand prize is a trip for two to an Open Source event anywhere in the
> world
> > http://moblin-contest.org/redirect.php?banner_id=100&url=/
> > _______________________________________________
> > Dbpedia-discussion mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> >
> >
> Marvin,
>
> You will see why when you run:
>
> select *
> where {graph ?g {
> ?s
> <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }}
>
>
> As you can see their are two graphs:
> 1. http://dbpedia.org
> 2. http://dbpedia.org/resource/<entity> (this one results from cache
> activity associated with client interactions with Virtuoso)
>
> Solutions:
> -- Being specific about source Graph by specifying Graph IRI
> select ?s
> where {graph <http://dbpedia.org> {
> ?s
> <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }}
>
> OR
>
> select ?s
> from <http://dbpedia.org>
> where {
> ?s
> <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }
>
> -- Using DISTINCT
>
> select distinct ?s
> where {
> ?s
> <http://dbpedia.org/property/influenced>
> <http://dbpedia.org/resource/Chris_Rock>
> }
>
What is the instruction to give with Jena/Other clients etc. to make it
behave in the same way as the HTTP SPARQL page interface and not resolve
triples from the cache graphs.
Cheers,
Peter
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion