|
HI Anthony.
Thanks for the feedback. It is possible, but unlikely, that I
have 14% difference. Keep in mind, the 14% difference is not with
fn:count(cts:triples()) vs counting sem:triple because
fn:count(cts:triples()) never finishes for me.
The 14% difference is between counting the triples and estimating
the count (number of triples in doc 1 times the number of docs
that include triples) This estimate is higher and is likely due
to not all docs having the same number of triples.
The challenge is: get an accurate count of the number of triples
in a database without having to add an index by hand. I cannot
seem to get that count via cts:triples or SPARQL.
Kind Regards,
David
On 24/02/14 10:42, Anthony Coates wrote:
Classification: For
internal use only
David, are all of your triples unique? A SPARQL-type count of
triples
(which should include cts:triples) would only return the number
of unique
triples, whereas a CTS search on sem:triple would include
redundant triples
in the count (I would assume). I'm wondering if that could be a
partial
reason for the 14% difference you see, unless you know that all
of your
triples are unique.
Cheers, Tony.
____________________________________________________

Anthony Coates
VP | Solution/Data Architect
Deutsche Bank AG, Filiale London
Global Technology and Operations (GTO)
1 Appold Street, EC2A 2UU London, United Kingdom
Tel. +44(20)754-77217
Mobile +44 7905439026
Email [email protected]
HI Mike.
Thanks for the reply. Yeah, was
surprised that cts:triples was not as efficient as I had hoped.
Adding
the index works, but just feels odd.
I really don't have a true use-case
for
needing the number - just in testing/developing I found it odd
that I could
not get the answer.
Estimating via a sample is OK - now
that
I know the true number. I had been estimating using the count
of
sem:triple on a random set of 100 docs and the end always ended
up at about
48 million - off by 14% - but on these biggish numbers, it still
gives
me a ballpark figure.
Regards,
David
|
| David
Ennis
|
| Content
Engineer
|
|
|
| Mastering
the value
of content
|
| creative
| technology
| content
|
|
| Delftechpark
37i
|
| 2628 XJ
Delft
|
| The
Netherlands
|
|
|
|
|
|
|
|
On 22 February 2014 16:30, Michael Blakeley <[email protected]>
wrote:
It seems like this should be possible in SPARQL,
but I
think a SPARQL doesn't have COUNT yet? When that's implemented
it might
also make sense to add an XQuery accessor, maybe something like
cts:remainder.
Another approach might be to make xdmp:estimate accurate for
triples.
The fact that count(doc()//sem:triple) is faster than
count(cts:triples())
may be a bug, or at least a missing optimization. If it's an
important
use-case for you, contact support.
If you don't mind a little imprecision you can sample. This
assumes the
count of triples in the first triple document is representative
of the
rest of the database.
count((//sem:triple)[1]/root()//sem:triple)
* xdmp:estimate(//sem:triple)
Of course you could sample more documents rather than just the
first one,
and adjust accordingly.
-- Mike
On 21 Feb 2014, at 23:04 , David Ennis <[email protected]>
wrote:
> Howdy.
>
> In trying to learn the details of the Triple Store in
MarkLogic, I
decided to keep kicking it until it dies. To really stress it, I
am using
a 1 CPU setup with 2 gig of memory and have loaded in ~42
million triples.
It grumbled a bit in the process, but succeeded and the graph
endpoint
on the rest interface is happy enough for some tesing..
>
> But... I am stumped... How can I get the count of all of
my
triples?
>
> Documentation suggests fn:count( cts:triples() ) - but
that
is unrealistic when you have any real volume..
>
> After some thoughts, I came up with this silly approach:
>
> - Added range index on sem:triples
>
> With this, I get OK results(considering hardware) when
counting in
the following ways:
> -
cts:count-aggregate(cts:element-reference(xs:QName("sem:triple")))
> - fn:count(doc()//sem:triple)
>
> This seems like a viable approach - because you can still
play
with the triples like they are any other document so I am
getting the benefit
of the index. But.. for this I added an index just for this
purpose, which
seems a bit silly.
>
> OK, maybe in production the question of how many triples I
have is
irrelevant, but for testing, it would be a nice thing to know..
>
> Does anyone else have any idea how to get a count of the
number of
triples in a system
>
> Regards,
> David
> David Ennis
> Content Engineer
>
> Mastering the value of content
> creative | technology | content
> Delftechpark 37i
> 2628 XJ Delft
> The Netherlands
> T: +31 88 268 25 00
> M: +31 6 000 000 00
>
>
>
>
_______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
---
This e-mail may contain confidential and/or privileged
information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender
immediately and delete this e-mail. Any unauthorized copying,
disclosure or distribution of the material in this e-mail is
strictly forbidden.
Please refer to http://www.db.com/en/content/eu_disclosures.htm
for additional EU corporate and regulatory disclosures and to
http://www.db.com/unitedkingdom/content/privacy.htm for
information about privacy.
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
--
|
David Ennis
|
| Content
Engineer |
|
![HintTech logo HintTech Mastering the value of content]() |
| Mastering
the value of content |
| creative | technology | content |
|
| Delftechpark 37i |
| 2628 XJ Delft |
| The Netherlands |
|
|
|
|
|
|
|
| |
|
|
|