Re: [MarkLogic Dev General] General Digest, Vol 140, Issue 56

RAVINDER MAAN Mon, 22 Feb 2016 05:17:45 -0800

Hi Geert

Thanks for the reply. I think I have found the reason for it and what I
found is very interesting. To remove all the factors which can effect the
performance I just created a 1 forest database on separate machine and
inserted 9 million documents into this database using below code.


xquery version "1.0-ml";

for $i in (1 to 9000000)
return
xdmp:eval('
xdmp:document-insert("/event/'||$i||'",
<event><a>AB.CDEF/2001XX000729-{xdmp:random(2000000)}</a><b>AB.CDEF/2001XX000729-{xdmp:random(40)}</b></event>)
')

I created 2 range indexes, one on element "a" and another on element "b".
The above query is generating documents with element a which has huge range
of values(2 million possible values). It is also generating element "b" but
range of values is only (40). Now if I run element-values query against
element "a" ordered by frequency it is very slow in comparison to same
query run on element "b". For element "a" even if I run query again and
again I am seeing response time of 3 seconds. Whereas for element b
response time is 220 milliseconds. Based on this I looked into how indexes
in Elasticsearch work and it is interesting that in Elasticsearch indexes
are sharded based on range.

I think the next test for this will be to try range based assignment policy
so that each forest contains small subset of range index.





Thanks & regards,
Ravinder Singh Maan

On Sun, Feb 21, 2016 at 3:39 PM, <[email protected]>
wrote:

> Send General mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://developer.marklogic.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of General digest..."
>
>
> Today's Topics:
>
>    1. Re: General Digest, Vol 140, Issue 54 (Geert Josten)
>    2. Re: General Digest, Vol 140, Issue 54 (Rob Szkutak)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 21 Feb 2016 11:42:24 +0000
> From: Geert Josten <[email protected]>
> Subject: Re: [MarkLogic Dev General] General Digest, Vol 140, Issue 54
> To: MarkLogic Developer Discussion <[email protected]>
> Message-ID: <d2ef5eaa.c456f%[email protected]>
> Content-Type: text/plain; charset="windows-1252"
>
> Hi Ravinder,
>
> Thanks for the info. So you have 12 physical cores in total, and an equal
> number of forests. That should mean you have roughly 12 mln docs per
> forest. That should be a nice number for fast faceting, and getting value
> frequencies.
>
> I am rather surprised about the 30 seconds though, and especially because
> the above sounds right. I ran a little comparison on an average demo server
> over here, with a single forest containing 16 mln docs. I restarted the
> server to make sure the caches are cold, and then ran the same code as you,
> only for a slightly different element index. It returned in 0.06 sec, which
> is kind of the order of magnitude I?d typically expect from MarkLogic.
> Using a cluster shouldn?t add much more, regardless of the number of nodes
> or forests. Are the number consistent if you rerun your test?
>
> You should always be able to get sub-sec results for this. And because
> that is clearly not happening, something else must be causing issues here.
> Low latency for instance, or maybe your indexes are taking more memory that
> MarkLogic is getting, meaning it could be swapping or such. How much free
> memory is available on the three nodes, and how fast is the network
> connection between them? Also, is anything else competing for cpu, memory,
> or network bandwidth perhaps?
>
> Cheers,
> Geert
>
> From: <[email protected]<mailto:
> [email protected]>> on behalf of RAVINDER MAAN <
> [email protected]<mailto:[email protected]>>
> Reply-To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]>>
> Date: Saturday, February 20, 2016 at 11:08 PM
> To: "[email protected]<mailto:
> [email protected]>" <[email protected]<mailto:
> [email protected]>>
> Subject: Re: [MarkLogic Dev General] General Digest, Vol 140, Issue 54
>
> Hi Geerat
>
> Thanks for reply. In ML it takes about 30 seconds and in elasticsearch it
> takes 4 seconds. It is cluster of 3 nodes. Each node has 16GB RAM and "ls
> /proc/cpuinfo" show 8 cores(I think it is because of hyper threading actual
> cores are 4). I have configured 4 forests per node. Do you think
> increasing/decreasing number of forests will help? As this is range index
> query so I guess entire index is in memory so other cache settings should
> not effect this query.
>
> If I run the query with query meters I just see below cache misses, all
> other caches hit/miss are 0.
>
> <qm:value-cache-misses>194</qm:value-cache-misses>
> <qm:regexp-cache-hits>181</qm:regexp-cache-hits>
> <qm:regexp-cache-misses>5</qm:regexp-cache-misses>
>
>
> Thanks & regards,
> Ravinder Singh Maan
>
> On Sat, Feb 20, 2016 at 7:33 PM, <[email protected]
> <mailto:[email protected]>> wrote:
> Send General mailing list submissions to
>         [email protected]<mailto:
> [email protected]>
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://developer.marklogic.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
>         [email protected]<mailto:
> [email protected]>
>
> You can reach the person managing the list at
>         [email protected]<mailto:
> [email protected]>
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of General digest..."
>
>
> Today's Topics:
>
>    1. Re: Best way to find most occuring word or sort by frequency
>       (Geert Josten)
>    2. Re: [1.0-ml] XDMP-TRPLIDXNOTFOUND: cts:triples() -- Triple
>       index not enabled (Geert Josten)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 20 Feb 2016 18:44:50 +0000
> From: Geert Josten <[email protected]<mailto:
> [email protected]>>
> Subject: Re: [MarkLogic Dev General] Best way to find most occuring
>         word or sort by frequency
> To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]>>
> Message-ID: <d2ee7209.c3d73%[email protected]<mailto:
> d2ee7209.c3d73%[email protected]>>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi,
>
> I think this is the right approach..
>
> If you talk about it being slow, how slow is that exactly? And how did you
> configure MarkLogic? More specifically, how many forest do you have? Also,
> how much memory, and cpu cores do you have?
>
> Kind regards,
> Geert
>
>
> From: <[email protected]<mailto:
> [email protected]><mailto:
> [email protected]<mailto:
> [email protected]>>> on behalf of RAVINDER MAAN <
> [email protected]<mailto:[email protected]><mailto:[email protected]
> <mailto:[email protected]>>>
> Reply-To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>>
> Date: Saturday, February 20, 2016 at 11:34 AM
> To: "[email protected]<mailto:
> [email protected]><mailto:[email protected]
> <mailto:[email protected]>>" <
> [email protected]<mailto:[email protected]
> ><mailto:[email protected]<mailto:
> [email protected]>>>
> Subject: [MarkLogic Dev General] Best way to find most occuring word or
> sort by frequency
>
> Hello all
>
> I want to sort element values by frequency. I have tried below
>
> for $word in cts:element-values(xs:QName("ELEMENT_NAME"),  (),
> ("frequency-order", "limit=10"))
> return <word count="{cts:frequency($word)}">{$word}</word>
>
>
> But for very large index this is slow in comparison to elasticsearch. I
> did this comparison on same machine with same data and of course only one
> of them was running when I did the comparison. There are about 250 million
> documents and frequency range is 1 million to hundreds i.e. if I run above
> query the word on the top has count 1000000.
>
> Is there any other way of doing same ?
>
>
> Thanks & regards,
> Ravinder Singh Maan
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://developer.marklogic.com/pipermail/general/attachments/20160220/eecf895c/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Sat, 20 Feb 2016 19:33:41 +0000
> From: Geert Josten <[email protected]<mailto:
> [email protected]>>
> Subject: Re: [MarkLogic Dev General] [1.0-ml] XDMP-TRPLIDXNOTFOUND:
>         cts:triples() -- Triple index not enabled
> To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]>>
> Message-ID: <d2ee7d69.c3ddc%[email protected]<mailto:
> d2ee7d69.c3ddc%[email protected]>>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Ga?l,
>
> You need to enable the triple-index. You can do that by going to the Admin
> UI of your MarkLogic installation, navigating to the relevant content
> database, and toggling the triple index from false to true there. It should
> be around the 10th edit option, so close to the top. Confirm the change by
> clicking OK at the top or bottom of the page, and then wait for the reindex
> to complete. You can follow the progress on the Status tab of that
> database. Refresh it once in a while to get it updated.
>
> Kind regards,
> Geert
>
> From: <[email protected]<mailto:
> [email protected]><mailto:
> [email protected]<mailto:
> [email protected]>>> on behalf of Ga?l YIMEN YIMGA <
> [email protected]<mailto:[email protected]><mailto:[email protected]
> <mailto:[email protected]>>>
> Reply-To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>>
> Date: Saturday, February 20, 2016 at 5:46 PM
> To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>>
> Subject: [MarkLogic Dev General] [1.0-ml] XDMP-TRPLIDXNOTFOUND:
> cts:triples() -- Triple index not enabled
>
> Hello All,
>
> I'm facing an issue in MarkLogic.
> I ran successfully the following query
> ===================
> import module namespace sem = "http://marklogic.com/semantics";
>       at "/MarkLogic/semantics.xqy";
>
> sem:rdf-insert(
>   (
>   sem:triple(
>     sem:iri("http://example.org/marklogic/people/John_Smith";),
>     sem:iri("http://example.org/marklogic/predicate/livesIn";),
>     "London"
>     )
>   ,
>   sem:triple(
>     sem:iri("http://example.org/marklogic/people/Jane_Smith";),
>     sem:iri("http://example.org/marklogic/predicate/livesIn";),
>     "London"
>     )
>   ,
>   sem:triple(
>     sem:iri("http://example.org/marklogic/people/Jack_Smith";),
>     sem:iri("http://example.org/marklogic/predicate/livesIn";),
>     "Glasgow"
>     )
>   )
> )
> ===================
>
> But in a secnond plan, I rand the following to count the number of triples
> =======
> xquery version "1.0-ml";
> declare namespace html = "http://www.w3.org/1999/xhtml";;
> fn:count(cts:triples());
> =======
> I got the following error in the image below
>
> [Images int?gr?es 1]
>
> Your help to fix this will be greatfull.
>
> Thanks in advance !!!
>
> Ga?l.
> --
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://developer.marklogic.com/pipermail/general/attachments/20160220/a4c6b935/attachment.html
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image.png
> Type: image/png
> Size: 17507 bytes
> Desc: image.png
> Url :
> http://developer.marklogic.com/pipermail/general/attachments/20160220/a4c6b935/attachment.png
>
> ------------------------------
>
> _______________________________________________
> General mailing list
> [email protected]<mailto:[email protected]>
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> End of General Digest, Vol 140, Issue 54
> ****************************************
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://developer.marklogic.com/pipermail/general/attachments/20160221/1e2a94a8/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Sun, 21 Feb 2016 15:39:40 +0000
> From: Rob Szkutak <[email protected]>
> Subject: Re: [MarkLogic Dev General] General Digest, Vol 140, Issue 54
> To: MarkLogic Developer Discussion <[email protected]>
> Message-ID:
>         <
> 6e8e665d710d394a853b6eec145fb7dc16570...@exchg10-be01.marklogic.com>
> Content-Type: text/plain; charset="windows-1252"
>
> Hi Ravinder,
>
> In addition to Geert's excellent suggestions, you should also take a look
> to see if you've configured your swap space correctly:
> https://docs.marklogic.com/guide/installation/intro#id_11335
>
> Best,
> Rob
>
> Rob Szkutak
> Senior Consultant
> MarkLogic Corporation
> [email protected]
> www.marklogic.com<http://www.marklogic.com>
>
> ________________________________
> From: [email protected] [
> [email protected]] on behalf of Geert Josten [
> [email protected]]
> Sent: Sunday, February 21, 2016 5:42 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] General Digest, Vol 140, Issue 54
>
> Hi Ravinder,
>
> Thanks for the info. So you have 12 physical cores in total, and an equal
> number of forests. That should mean you have roughly 12 mln docs per
> forest. That should be a nice number for fast faceting, and getting value
> frequencies.
>
> I am rather surprised about the 30 seconds though, and especially because
> the above sounds right. I ran a little comparison on an average demo server
> over here, with a single forest containing 16 mln docs. I restarted the
> server to make sure the caches are cold, and then ran the same code as you,
> only for a slightly different element index. It returned in 0.06 sec, which
> is kind of the order of magnitude I?d typically expect from MarkLogic.
> Using a cluster shouldn?t add much more, regardless of the number of nodes
> or forests. Are the number consistent if you rerun your test?
>
> You should always be able to get sub-sec results for this. And because
> that is clearly not happening, something else must be causing issues here.
> Low latency for instance, or maybe your indexes are taking more memory that
> MarkLogic is getting, meaning it could be swapping or such. How much free
> memory is available on the three nodes, and how fast is the network
> connection between them? Also, is anything else competing for cpu, memory,
> or network bandwidth perhaps?
>
> Cheers,
> Geert
>
> From: <[email protected]<mailto:
> [email protected]>> on behalf of RAVINDER MAAN <
> [email protected]<mailto:[email protected]>>
> Reply-To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]>>
> Date: Saturday, February 20, 2016 at 11:08 PM
> To: "[email protected]<mailto:
> [email protected]>" <[email protected]<mailto:
> [email protected]>>
> Subject: Re: [MarkLogic Dev General] General Digest, Vol 140, Issue 54
>
> Hi Geerat
>
> Thanks for reply. In ML it takes about 30 seconds and in elasticsearch it
> takes 4 seconds. It is cluster of 3 nodes. Each node has 16GB RAM and "ls
> /proc/cpuinfo" show 8 cores(I think it is because of hyper threading actual
> cores are 4). I have configured 4 forests per node. Do you think
> increasing/decreasing number of forests will help? As this is range index
> query so I guess entire index is in memory so other cache settings should
> not effect this query.
>
> If I run the query with query meters I just see below cache misses, all
> other caches hit/miss are 0.
>
> <qm:value-cache-misses>194</qm:value-cache-misses>
> <qm:regexp-cache-hits>181</qm:regexp-cache-hits>
> <qm:regexp-cache-misses>5</qm:regexp-cache-misses>
>
>
> Thanks & regards,
> Ravinder Singh Maan
>
> On Sat, Feb 20, 2016 at 7:33 PM, <[email protected]
> <mailto:[email protected]>> wrote:
> Send General mailing list submissions to
>         [email protected]<mailto:
> [email protected]>
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://developer.marklogic.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
>         [email protected]<mailto:
> [email protected]>
>
> You can reach the person managing the list at
>         [email protected]<mailto:
> [email protected]>
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of General digest..."
>
>
> Today's Topics:
>
>    1. Re: Best way to find most occuring word or sort by frequency
>       (Geert Josten)
>    2. Re: [1.0-ml] XDMP-TRPLIDXNOTFOUND: cts:triples() -- Triple
>       index not enabled (Geert Josten)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 20 Feb 2016 18:44:50 +0000
> From: Geert Josten <[email protected]<mailto:
> [email protected]>>
> Subject: Re: [MarkLogic Dev General] Best way to find most occuring
>         word or sort by frequency
> To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]>>
> Message-ID: <d2ee7209.c3d73%[email protected]<mailto:
> d2ee7209.c3d73%[email protected]>>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi,
>
> I think this is the right approach..
>
> If you talk about it being slow, how slow is that exactly? And how did you
> configure MarkLogic? More specifically, how many forest do you have? Also,
> how much memory, and cpu cores do you have?
>
> Kind regards,
> Geert
>
>
> From: <[email protected]<mailto:
> [email protected]><mailto:
> [email protected]<mailto:
> [email protected]>>> on behalf of RAVINDER MAAN <
> [email protected]<mailto:[email protected]><mailto:[email protected]
> <mailto:[email protected]>>>
> Reply-To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>>
> Date: Saturday, February 20, 2016 at 11:34 AM
> To: "[email protected]<mailto:
> [email protected]><mailto:[email protected]
> <mailto:[email protected]>>" <
> [email protected]<mailto:[email protected]
> ><mailto:[email protected]<mailto:
> [email protected]>>>
> Subject: [MarkLogic Dev General] Best way to find most occuring word or
> sort by frequency
>
> Hello all
>
> I want to sort element values by frequency. I have tried below
>
> for $word in cts:element-values(xs:QName("ELEMENT_NAME"),  (),
> ("frequency-order", "limit=10"))
> return <word count="{cts:frequency($word)}">{$word}</word>
>
>
> But for very large index this is slow in comparison to elasticsearch. I
> did this comparison on same machine with same data and of course only one
> of them was running when I did the comparison. There are about 250 million
> documents and frequency range is 1 million to hundreds i.e. if I run above
> query the word on the top has count 1000000.
>
> Is there any other way of doing same ?
>
>
> Thanks & regards,
> Ravinder Singh Maan
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://developer.marklogic.com/pipermail/general/attachments/20160220/eecf895c/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Sat, 20 Feb 2016 19:33:41 +0000
> From: Geert Josten <[email protected]<mailto:
> [email protected]>>
> Subject: Re: [MarkLogic Dev General] [1.0-ml] XDMP-TRPLIDXNOTFOUND:
>         cts:triples() -- Triple index not enabled
> To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]>>
> Message-ID: <d2ee7d69.c3ddc%[email protected]<mailto:
> d2ee7d69.c3ddc%[email protected]>>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Ga?l,
>
> You need to enable the triple-index. You can do that by going to the Admin
> UI of your MarkLogic installation, navigating to the relevant content
> database, and toggling the triple index from false to true there. It should
> be around the 10th edit option, so close to the top. Confirm the change by
> clicking OK at the top or bottom of the page, and then wait for the reindex
> to complete. You can follow the progress on the Status tab of that
> database. Refresh it once in a while to get it updated.
>
> Kind regards,
> Geert
>
> From: <[email protected]<mailto:
> [email protected]><mailto:
> [email protected]<mailto:
> [email protected]>>> on behalf of Ga?l YIMEN YIMGA <
> [email protected]<mailto:[email protected]><mailto:[email protected]
> <mailto:[email protected]>>>
> Reply-To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>>
> Date: Saturday, February 20, 2016 at 5:46 PM
> To: MarkLogic Developer Discussion <[email protected]
> <mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>>
> Subject: [MarkLogic Dev General] [1.0-ml] XDMP-TRPLIDXNOTFOUND:
> cts:triples() -- Triple index not enabled
>
> Hello All,
>
> I'm facing an issue in MarkLogic.
> I ran successfully the following query
> ===================
> import module namespace sem = "http://marklogic.com/semantics";
>       at "/MarkLogic/semantics.xqy";
>
> sem:rdf-insert(
>   (
>   sem:triple(
>     sem:iri("http://example.org/marklogic/people/John_Smith";),
>     sem:iri("http://example.org/marklogic/predicate/livesIn";),
>     "London"
>     )
>   ,
>   sem:triple(
>     sem:iri("http://example.org/marklogic/people/Jane_Smith";),
>     sem:iri("http://example.org/marklogic/predicate/livesIn";),
>     "London"
>     )
>   ,
>   sem:triple(
>     sem:iri("http://example.org/marklogic/people/Jack_Smith";),
>     sem:iri("http://example.org/marklogic/predicate/livesIn";),
>     "Glasgow"
>     )
>   )
> )
> ===================
>
> But in a secnond plan, I rand the following to count the number of triples
> =======
> xquery version "1.0-ml";
> declare namespace html = "http://www.w3.org/1999/xhtml";;
> fn:count(cts:triples());
> =======
> I got the following error in the image below
>
> [Images int?gr?es 1]
>
> Your help to fix this will be greatfull.
>
> Thanks in advance !!!
>
> Ga?l.
> --
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://developer.marklogic.com/pipermail/general/attachments/20160220/a4c6b935/attachment.html
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: image.png
> Type: image/png
> Size: 17507 bytes
> Desc: image.png
> Url :
> http://developer.marklogic.com/pipermail/general/attachments/20160220/a4c6b935/attachment.png
>
> ------------------------------
>
> _______________________________________________
> General mailing list
> [email protected]<mailto:[email protected]>
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> End of General Digest, Vol 140, Issue 54
> ****************************************
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://developer.marklogic.com/pipermail/general/attachments/20160221/0ffca307/attachment.html
>
> ------------------------------
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> End of General Digest, Vol 140, Issue 56
> ****************************************
>

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] General Digest, Vol 140, Issue 56

Reply via email to