The cts:triples() call is de-duplicating triples as it brings them back to the E-node. Other methods are faster because they avoid the de-duplication and can therefore often perform operations on D-nodes rather than centrally on the E-node.
We hope to add a function that estimates the count of triples in the database in future. John On 24/02/14 10:02, David Ennis wrote: > HI Anthony. > > Thanks for the feedback. It is possible, but unlikely, that I have 14% > difference. Keep in mind, the 14% difference is not with > fn:count(cts:triples()) vs counting sem:triple because > fn:count(cts:triples()) never finishes for me. > > The 14% difference is between counting the triples and estimating the > count (number of triples in doc 1 times the number of docs that include > triples) This estimate is higher and is likely due to not all docs > having the same number of triples. > > The challenge is: get an accurate count of the number of triples in a > database without having to add an index by hand. I cannot seem to get > that count via cts:triples or SPARQL. > > Kind Regards, > David > > > > On 24/02/14 10:42, Anthony Coates wrote: >> Classification: For internal use only >> >> David, are all of your triples unique? A SPARQL-type count of triples >> (which should include cts:triples) would only return the number of >> unique triples, whereas a CTS search on sem:triple would include >> redundant triples in the count (I would assume). I'm wondering if >> that could be a partial reason for the 14% difference you see, unless >> you know that all of your triples are unique. >> >> Cheers, Tony. >> >> ____________________________________________________ >> >> >> >> Anthony Coates >> VP | Solution/Data Architect >> >> Deutsche Bank AG, Filiale London >> Global Technology and Operations (GTO) >> 1 Appold Street, EC2A 2UU London, United Kingdom >> Tel. +44(20)754-77217 >> Mobile +44 7905439026 >> Email [email protected]_ <mailto:[email protected]> >> >> >> >> >> >> From: David Ennis <[email protected]> >> To: MarkLogic Developer Discussion <[email protected]>, >> Date: 22/02/2014 16:28 >> Subject: Re: [MarkLogic Dev General] Count number of triples >> >> >> ------------------------------------------------------------------------ >> >> >> >> HI Mike. >> >> Thanks for the reply. Yeah, was surprised that cts:triples was not >> as efficient as I had hoped. Adding the index works, but just feels odd. >> >> I really don't have a true use-case for needing the number - just in >> testing/developing I found it odd that I could not get the answer. >> >> Estimating via a sample is OK - now that I know the true number. I >> had been estimating using the count of sem:triple on a random set of >> 100 docs and the end always ended up at about 48 million - off by 14% >> - but on these biggish numbers, it still gives me a ballpark figure. >> >> Regards, >> David >> >> >> >> *David Ennis* >> *Content Engineer* >> >> >> Mastering the value of content >> creative | technology | content >> >> Delftechpark 37i >> 2628 XJ Delft >> The Netherlands >> T: +31 88 268 25 00 >> >> >> M: +31 6 000 000 00 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 22 February 2014 16:30, Michael Blakeley <[email protected]_ >> <mailto:[email protected]>> wrote: >> It seems like this should be possible in SPARQL, but I think a SPARQL >> doesn't have COUNT yet? When that's implemented it might also make >> sense to add an XQuery accessor, maybe something like cts:remainder. >> Another approach might be to make xdmp:estimate accurate for triples. >> >> The fact that count(doc()//sem:triple) is faster than >> count(cts:triples()) may be a bug, or at least a missing optimization. >> If it's an important use-case for you, contact support. >> >> If you don't mind a little imprecision you can sample. This assumes >> the count of triples in the first triple document is representative of >> the rest of the database. >> >> count((//sem:triple)[1]/root()//sem:triple) >> * xdmp:estimate(//sem:triple) >> >> Of course you could sample more documents rather than just the first >> one, and adjust accordingly. >> >> -- Mike >> >> On 21 Feb 2014, at 23:04 , David Ennis <[email protected]_ >> <mailto:[email protected]>> wrote: >> >> > Howdy. >> > >> > In trying to learn the details of the Triple Store in MarkLogic, I >> decided to keep kicking it until it dies. To really stress it, I am >> using a 1 CPU setup with 2 gig of memory and have loaded in ~42 >> million triples. It grumbled a bit in the process, but succeeded and >> the graph endpoint on the rest interface is happy enough for some tesing.. >> > >> > But... I am stumped... How can I get the count of all of my triples? >> > >> > Documentation suggests fn:count( cts:triples() ) - but that is >> unrealistic when you have any real volume.. >> > >> > After some thoughts, I came up with this silly approach: >> > >> > - Added range index on sem:triples >> > >> > With this, I get OK results(considering hardware) when counting in >> the following ways: >> > - cts:count-aggregate(cts:element-reference(xs:QName("sem:triple"))) >> > - fn:count(doc()//sem:triple) >> > >> > This seems like a viable approach - because you can still play with >> the triples like they are any other document so I am getting the >> benefit of the index. But.. for this I added an index just for this >> purpose, which seems a bit silly. >> > >> > OK, maybe in production the question of how many triples I have is >> irrelevant, but for testing, it would be a nice thing to know.. >> > >> > Does anyone else have any idea how to get a count of the number of >> triples in a system >> > >> > Regards, >> > David >> > David Ennis >> > Content Engineer >> > >> > Mastering the value of content >> > creative | technology | content >> > Delftechpark 37i >> > 2628 XJ Delft >> > The Netherlands >> > T: +31 88 268 25 00 >> > M: +31 6 000 000 00 >> > >> > >> > >> > _______________________________________________ >> > General mailing list >> > [email protected]_ >> <mailto:[email protected]> >> > _http://developer.marklogic.com/mailman/listinfo/general_ >> >> _______________________________________________ >> General mailing list_ >> [email protected]_ >> <mailto:[email protected]>_ >> __http://developer.marklogic.com/mailman/listinfo/general_ >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> >> >> >> >> >> --- >> >> This e-mail may contain confidential and/or privileged information. If >> you are not the intended recipient (or have received this e-mail in >> error) please notify the sender immediately and delete this e-mail. >> Any unauthorized copying, disclosure or distribution of the material >> in this e-mail is strictly forbidden. >> >> Please refer to http://www.db.com/en/content/eu_disclosures.htm for >> additional EU corporate and regulatory disclosures and to >> http://www.db.com/unitedkingdom/content/privacy.htm for information >> about privacy. >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general > > > -- > > David Ennis > Content Engineer > > HintTech Mastering the value of content <http://www.hinttech.com> > Mastering the value of content > creative | technology |content > > Delftechpark 37i > 2628 XJ Delft > The Netherlands > T: +31 88 268 25 00 > > M: +31 6 000 000 00 > > > Website <http://www.hinttech.com> Twitter > <https://twitter.com/HintTech> Facebook > <http://www.facebook.com/HintTech> LinkedIn > <http://www.linkedin.com/company/HintTech> > > HintTech Mastering the value of content <http://www.dayon.nl> > > > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > -- John Snelson, Lead Engineer http://twitter.com/jpcs MarkLogic Corporation http://www.marklogic.com _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
