Classification: For internal use only

David, are all of your triples unique?  A SPARQL-type count of triples 
(which should include cts:triples) would only return the number of unique 
triples, whereas a CTS search on sem:triple would include redundant 
triples in the count (I would assume).  I'm wondering if that could be a 
partial reason for the 14% difference you see, unless you know that all of 
your triples are unique.

Cheers, Tony.

____________________________________________________



Anthony Coates
VP | Solution/Data Architect

Deutsche Bank AG, Filiale London
Global Technology and Operations (GTO)
1 Appold Street, EC2A 2UU London, United Kingdom
Tel. +44(20)754-77217
Mobile +44 7905439026
Email [email protected]






From:
David Ennis <[email protected]>
To:
MarkLogic Developer Discussion <[email protected]>, 
Date:
22/02/2014 16:28
Subject:
Re: [MarkLogic Dev General] Count number of triples



HI Mike.

Thanks for the reply.  Yeah,  was surprised that cts:triples was not as 
efficient as I had hoped. Adding the index works, but just feels odd.

I really don't have a true use-case for needing the number - just in 
testing/developing I found it odd that I could not  get the answer.

Estimating via a sample is OK - now that I know the true number.  I had 
been estimating using the count of sem:triple on a random set of 100 docs 
and the end always ended up at about 48 million - off by 14% - but on 
these biggish numbers, it still gives me a ballpark figure.

Regards,
David




David Ennis
Content Engineer


Mastering the value of content
creative | technology | content

Delftechpark 37i
2628 XJ Delft
The Netherlands

T:
+31 88 268 25 00


M:
+31 6 000 000 00








 





On 22 February 2014 16:30, Michael Blakeley <[email protected]> wrote:
It seems like this should be possible in SPARQL, but I think a SPARQL 
doesn't have COUNT yet? When that's implemented it might also make sense 
to add an XQuery accessor, maybe something like cts:remainder. Another 
approach might be to make xdmp:estimate accurate for triples.

The fact that count(doc()//sem:triple) is faster than count(cts:triples()) 
may be a bug, or at least a missing optimization. If it's an important 
use-case for you, contact support.

If you don't mind a little imprecision you can sample. This assumes the 
count of triples in the first triple document is representative of the 
rest of the database.

    count((//sem:triple)[1]/root()//sem:triple)
    * xdmp:estimate(//sem:triple)

Of course you could sample more documents rather than just the first one, 
and adjust accordingly.

-- Mike

On 21 Feb 2014, at 23:04 , David Ennis <[email protected]> wrote:

> Howdy.
>
> In trying to learn the details of the Triple Store in MarkLogic, I 
decided to keep kicking it until it dies. To really stress it, I am using 
a 1 CPU setup with 2 gig of memory and have loaded in ~42 million triples. 
 It grumbled a bit in the process, but succeeded and the graph endpoint on 
the rest interface is happy enough for some tesing..
>
> But...  I am stumped... How can I get the count of all of my triples?
>
> Documentation suggests fn:count( cts:triples() )  - but that is 
unrealistic when you have any real volume..
>
> After some thoughts, I came up with this silly approach:
>
> - Added range index on sem:triples
>
> With this, I get OK results(considering hardware) when counting in the 
following ways:
> - cts:count-aggregate(cts:element-reference(xs:QName("sem:triple")))
> - fn:count(doc()//sem:triple)
>
> This seems like a viable approach  - because you can still play with the 
triples like they are any other document so I am getting the benefit of 
the index. But.. for this I added an index just for this purpose, which 
seems a bit silly.
>
> OK, maybe in production the question of how many triples I have is 
irrelevant, but for testing, it would be a nice thing to know..
>
> Does anyone else have any idea how to get a count of the number of 
triples in a system
>
> Regards,
> David
> David Ennis
> Content Engineer
>
> Mastering the value of content
> creative | technology | content
> Delftechpark 37i
> 2628 XJ Delft
> The Netherlands
> T:    +31 88 268 25 00
> M:    +31 6 000 000 00
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general






---

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient (or have received this e-mail in error) please 
notify the sender immediately and delete this e-mail. Any unauthorized copying, 
disclosure or distribution of the material in this e-mail is strictly forbidden.

Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional 
EU corporate and regulatory disclosures and to 
http://www.db.com/unitedkingdom/content/privacy.htm for information about 
privacy.

<<image/gif>>

<<image/gif>>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to