Re: [MarkLogic Dev General] Count number of triples [I]

John Snelson Mon, 24 Feb 2014 02:23:47 -0800

The cts:triples() call is de-duplicating triples as it brings them back 
to the E-node. Other methods are faster because they avoid the 
de-duplication and can therefore often perform operations on D-nodes 
rather than centrally on the E-node.


We hope to add a function that estimates the count of triples in the 
database in future.

John

On 24/02/14 10:02, David Ennis wrote:
> HI Anthony.
>
> Thanks for the feedback.  It is possible, but unlikely, that I have 14%
> difference.  Keep in mind, the 14% difference is not with
> fn:count(cts:triples()) vs counting sem:triple because
> fn:count(cts:triples()) never finishes for me.
>
> The 14% difference is between counting the triples and estimating the
> count (number of triples in doc 1 times the number of docs that include
> triples)  This estimate is higher and is likely due to not all docs
> having the same number of triples.
>
> The challenge is: get an accurate count of the number of triples in a
> database without having to add an index by hand.  I cannot seem to get
> that count via cts:triples or SPARQL.
>
> Kind Regards,
> David
>
>
>
> On 24/02/14 10:42, Anthony Coates wrote:
>> Classification: For internal use only
>>
>> David, are all of your triples unique?  A SPARQL-type count of triples
>> (which should include cts:triples) would only return the number of
>> unique triples, whereas a CTS search on sem:triple would include
>> redundant triples in the count (I would assume).  I'm wondering if
>> that could be a partial reason for the 14% difference you see, unless
>> you know that all of your triples are unique.
>>
>> Cheers, Tony.
>>
>> ____________________________________________________
>>
>>
>>
>> Anthony Coates
>> VP | Solution/Data Architect
>>
>> Deutsche Bank AG, Filiale London
>> Global Technology and Operations (GTO)
>> 1 Appold Street, EC2A 2UU London, United Kingdom
>> Tel. +44(20)754-77217
>> Mobile +44 7905439026
>> Email [email protected]_ <mailto:[email protected]>
>>
>>
>>
>>
>>
>> From:        David Ennis <[email protected]>
>> To:  MarkLogic Developer Discussion <[email protected]>,
>> Date:        22/02/2014 16:28
>> Subject:     Re: [MarkLogic Dev General] Count number of triples
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> HI Mike.
>>
>> Thanks for the reply.  Yeah,  was surprised that cts:triples was not
>> as efficient as I had hoped. Adding the index works, but just feels odd.
>>
>> I really don't have a true use-case for needing the number - just in
>> testing/developing I found it odd that I could not  get the answer.
>>
>> Estimating via a sample is OK - now that I know the true number.  I
>> had been estimating using the count of sem:triple on a random set of
>> 100 docs and the end always ended up at about 48 million - off by 14%
>> - but on these biggish numbers, it still gives me a ballpark figure.
>>
>> Regards,
>> David
>>
>>
>>
>> *David Ennis*
>> *Content Engineer*
>>
>>
>> Mastering the value of content
>> creative | technology | content
>>
>> Delftechpark 37i
>> 2628 XJ Delft
>> The Netherlands
>> T:   +31 88 268 25 00
>>
>>
>> M:   +31 6 000 000 00
>>
>>
>>
>>
>>      
>>      
>>      
>>
>>
>>
>>
>>
>>
>>
>> On 22 February 2014 16:30, Michael Blakeley <[email protected]_
>> <mailto:[email protected]>> wrote:
>> It seems like this should be possible in SPARQL, but I think a SPARQL
>> doesn't have COUNT yet? When that's implemented it might also make
>> sense to add an XQuery accessor, maybe something like cts:remainder.
>> Another approach might be to make xdmp:estimate accurate for triples.
>>
>> The fact that count(doc()//sem:triple) is faster than
>> count(cts:triples()) may be a bug, or at least a missing optimization.
>> If it's an important use-case for you, contact support.
>>
>> If you don't mind a little imprecision you can sample. This assumes
>> the count of triples in the first triple document is representative of
>> the rest of the database.
>>
>>     count((//sem:triple)[1]/root()//sem:triple)
>>     * xdmp:estimate(//sem:triple)
>>
>> Of course you could sample more documents rather than just the first
>> one, and adjust accordingly.
>>
>> -- Mike
>>
>> On 21 Feb 2014, at 23:04 , David Ennis <[email protected]_
>> <mailto:[email protected]>> wrote:
>>
>> > Howdy.
>> >
>> > In trying to learn the details of the Triple Store in MarkLogic, I
>> decided to keep kicking it until it dies. To really stress it, I am
>> using a 1 CPU setup with 2 gig of memory and have loaded in ~42
>> million triples.  It grumbled a bit in the process, but succeeded and
>> the graph endpoint on the rest interface is happy enough for some tesing..
>> >
>> > But...  I am stumped... How can I get the count of all of my triples?
>> >
>> > Documentation suggests fn:count( cts:triples() )  - but that is
>> unrealistic when you have any real volume..
>> >
>> > After some thoughts, I came up with this silly approach:
>> >
>> > - Added range index on sem:triples
>> >
>> > With this, I get OK results(considering hardware) when counting in
>> the following ways:
>> > - cts:count-aggregate(cts:element-reference(xs:QName("sem:triple")))
>> > - fn:count(doc()//sem:triple)
>> >
>> > This seems like a viable approach  - because you can still play with
>> the triples like they are any other document so I am getting the
>> benefit of the index. But.. for this I added an index just for this
>> purpose, which seems a bit silly.
>> >
>> > OK, maybe in production the question of how many triples I have is
>> irrelevant, but for testing, it would be a nice thing to know..
>> >
>> > Does anyone else have any idea how to get a count of the number of
>> triples in a system
>> >
>> > Regards,
>> > David
>> > David Ennis
>> > Content Engineer
>> >
>> > Mastering the value of content
>> > creative | technology | content
>> > Delftechpark 37i
>> > 2628 XJ Delft
>> > The Netherlands
>> > T:    +31 88 268 25 00
>> > M:    +31 6 000 000 00
>> >
>> >
>> >
>> > _______________________________________________
>> > General mailing list
>> > [email protected]_
>> <mailto:[email protected]>
>> > _http://developer.marklogic.com/mailman/listinfo/general_
>>
>> _______________________________________________
>> General mailing list_
>> [email protected]_
>> <mailto:[email protected]>_
>> __http://developer.marklogic.com/mailman/listinfo/general_
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>>
>>
>>
>>
>> ---
>>
>> This e-mail may contain confidential and/or privileged information. If
>> you are not the intended recipient (or have received this e-mail in
>> error) please notify the sender immediately and delete this e-mail.
>> Any unauthorized copying, disclosure or distribution of the material
>> in this e-mail is strictly forbidden.
>>
>> Please refer to http://www.db.com/en/content/eu_disclosures.htm for
>> additional EU corporate and regulatory disclosures and to
>> http://www.db.com/unitedkingdom/content/privacy.htm for information
>> about privacy.
>>
>>
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>
>
> --
>
> David Ennis
> Content Engineer
>
> HintTech Mastering the value of content <http://www.hinttech.com>
> Mastering the value of content
> creative | technology |content
>
> Delftechpark 37i
> 2628 XJ Delft
> The Netherlands
> T:    +31 88 268 25 00
>
> M:    +31 6 000 000 00
>
>
> Website <http://www.hinttech.com>     Twitter
> <https://twitter.com/HintTech>        Facebook
> <http://www.facebook.com/HintTech>    LinkedIn
> <http://www.linkedin.com/company/HintTech>
>
> HintTech Mastering the value of content <http://www.dayon.nl>
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>


-- 
John Snelson, Lead Engineer                    http://twitter.com/jpcs
MarkLogic Corporation                         http://www.marklogic.com
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Count number of triples [I]

Reply via email to