Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Justin Makeig Fri, 23 Sep 2016 13:03:12 -0700

It’s the combinatorial explosion to get to those 38 tuples that’s the problem. 
What do the cardinalities of each of the “columns” (range indexes) look like? 
Is there a way you can reduce those?


Justin

--
Justin Makeig
Director, Product Management
MarkLogic


> On Sep 23, 2016, at 12:53 PM, Mark Shanks <[email protected]> wrote:
> 
> I've already said it wasn't due to a high number of value-tuples. There are 
> only 38 value-tuples returned in total. Hence, limiting the result to the 
> first 100 [1 to 100] as you suggested is the same as the original query, and 
> the execution time is the same. I ran the code with your modification to 
> confirm this.
> From: [email protected] 
> <[email protected]> on behalf of Rob Szkutak 
> <[email protected]>
> Sent: Saturday, 24 September 2016 2:45:32 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
>  
> Hi,
> 
> My assumption as I've written previously would be #3.
> 
> A very simple way to check would be cts:value-tuples()[100] . Adding the 
> [100] on the end would limit yourself to returning no more than the first 100 
> tuples of your result set. It wouldn't reduce the number of documents that 
> are evaluated. (To prove that, you could also do [100 to 200]). If your 
> theory about #2 is correct, then adding [100] shouldn't improve performance.
> 
> Best,
> Rob
> 
> Rob Szkutak
> Senior Consultant
> MarkLogic Corporation
> [email protected]
> www.marklogic.com
> 
> From: [email protected] 
> [[email protected]] on behalf of Mark Shanks 
> [[email protected]]
> Sent: Friday, September 23, 2016 11:36 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
> 
> Yes, many values were fine with a 10,000 document set but slowed down 
> massively when run against several million. To be clear, there are at least 3 
> counts we could be talking about. 1) The total number of documents in the 
> database. 2) The number of documents that the query is restricted to (such as 
> restricting to a certain date range). 3) The total number of value-tuples 
> returned. My experience is that the number 2 is driving the slowness (i.e., 
> the total number of value-tuples returned may be the same, but when marklogic 
> needs to determine this set over millions of documents rather than a small 
> number, performance degrades more than would be expected based on the number 
> alone, at least compared to the case of returning only 2 facets. 
> 
> I'm still unclear of what is going on under the hood in Marklogic. The 
> following link (https://docs.marklogic.com/guide/search-dev/lexicon) talks 
> about value co-occurrrence lexicons. If this is built, then 2 facets could 
> just refer to this and would result in the extremely fast performance 
> observed. On the other hand, 3 or more facets would not have a pre-prepared 
> lexicon to quiz. The documentation isn't clear whether a co-occurrence 
> lexicon is built whenever an index is built, or whether it needs to be 
> specifically configured. The documentation about creating lexicons points you 
> to the " 'Text Indexing' and 'Element/Attribute Range Indexes and Lexicons' 
> chapters of the Administrator's Guide", but these then don't mention 
> co-occurrence lexicons at all. So it isn't clear how you actually get a 
> co-occurrence lexicon built.
> 
> Thanks.
> Browsing With Lexicons (Search Developer's Guide ...
> docs.marklogic.com
> Browsing With Lexicons. MarkLogic Server allows you to create lexicons, which 
> are lists of unique words or values, either throughout an entire database 
> (words only ...
> 
> From: [email protected] 
> <[email protected]> on behalf of Rob Szkutak 
> <[email protected]>
> Sent: Friday, 23 September 2016 10:13:41 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
>  
> Hi,
> 
> I thought in your earlier email you implied that many values were fine with a 
> 10,000 document set but slowed down when run against several million? This 
> lead me to believe the slowdown is caused by returning too many tuples.
> 
> A simple test to confirm if its a problem with the size of the result set 
> would be to limit the size of the result set and see if your performance 
> improves.
> 
> Best,
> Rob
> 
> Rob Szkutak
> Senior Consultant
> MarkLogic Corporation
> [email protected]
> www.marklogic.com
> 
> From: [email protected] 
> [[email protected]] on behalf of Mark Shanks 
> [[email protected]]
> Sent: Thursday, September 22, 2016 7:02 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
> 
> Thanks. The point is that the execution time isn't increasing at an 
> exponential rate. Note also that each of the facets had about the same number 
> of entries, so it isn't as if the number of tuples increased from, e.g., 50 
> to 4 million. I find it interesting that marklogic has a separate statement 
> cts:value-co-occurrences for looking at effectively 2 facets. Seems maybe 
> that 2 facets are cached in some way or some shortcut is provided for their 
> computation, whereas more than 2 needs to go a longer way that requires much 
> more processing than either 1 or 2.
> From: [email protected] 
> <[email protected]> on behalf of Rob Szkutak 
> <[email protected]>
> Sent: Friday, 23 September 2016 4:58:01 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
>  
> Hi,
> 
> As you add more values to the value-tuples call, you will typically 
> exponentially increase the amount of results you receive. The total number of 
> results will be the total number of all possible unique combinations of all 
> values. More values means more unique combinations of all values.
> 
> If your code you had :
> 
> for $each in $tuples
> return
> fn:concat()
> 
> If you have 4 million documents, you could be returning 4 million tuples at 
> most or easily returning some other number of tuples in the millions.
> 
> If you wrote code in any platform that did something like "for each tuple in 
> a set of millions do something" then you will expect that processing to take 
> some time.
> 
> So, what are your options?
> 
> 1) You could order your tuples by the most (or least) common ones and then 
> paginate the results, returning a much smaller number for each page.
> 
> 2) You could cache the information as data is ingested into a document and 
> then pull that document instead of doing all the work to figure it out on the 
> fly.
> 
> 3) You could investigate upgrading your hardware and see if that helps the 
> processing complete more quickly.
> 
> I would personally recommend #1 . If you're getting back a large number of 
> results, you'll absolutely find #1 to be the most navigable alternative.
> 
> Best,
> Rob
> 
> Rob Szkutak
> Senior Consultant
> MarkLogic Corporation
> [email protected]
> www.marklogic.com
> 
> From: [email protected] 
> [[email protected]] on behalf of Mark Shanks 
> [[email protected]]
> Sent: Thursday, September 22, 2016 1:32 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
> 
> As a follow-up, we found that the query was super fast with a small dataset 
> (e.g., 10,000 records). On the other hand, with a large dataset (40 million, 
> and pulling around 1 milllion records), we found that the query would be 
> super fast with 1 or 2 facets, e.g.:
> 
> let $tuples :=
>   cts:value-tuples(
>     ( 
>     cts:element-reference(xs:QName("Site"))
>     ),
>    (),
>    cts:and-query((
>      cts:element-range-query(xs:QName("Audit_Date"), ">", 
> xs:date("2010-01-01")),
>      cts:element-range-query(xs:QName("Audit_Date"), "<", 
> xs:date("2011-01-01"))
>    ))
>   )
> 
> or 
> 
> let $tuples :=
>   cts:value-tuples(
>     ( 
>     cts:element-reference(xs:QName("Site")),
>        cts:element-reference(xs:QName("Department"))
>     ),
>    (),
>    cts:and-query((
>      cts:element-range-query(xs:QName("Audit_Date"), ">", 
> xs:date("2010-01-01")),
>      cts:element-range-query(xs:QName("Audit_Date"), "<", 
> xs:date("2011-01-01"))
>    ))
>   )
> 
> but would take a massive performance hit once the facets are increased to 3, 
> and 4 was much slower again. E.g.:
> 
> let $tuples :=
>   cts:value-tuples(
>     ( 
>     cts:element-reference(xs:QName("Site")),
>        cts:element-reference(xs:QName("Department")),
>     cts:element-reference(xs:QName("LOB"))
>     ),
>    (),
>    cts:and-query((
>      cts:element-range-query(xs:QName("Audit_Date"), ">", 
> xs:date("2010-01-01")),
>      cts:element-range-query(xs:QName("Audit_Date"), "<", 
> xs:date("2011-01-01"))
>    ))
>   )
> 
> By performance hit, I mean the first two queries would take 1 second each. 
> Pulling 3 facets would take 250 seconds, and pulling 4 facets would take 350 
> seconds. Anyone have any idea of what is going on under the hood to lead to 
> such a breaking point between 1-2 facets and more facets? Any better way to 
> do the query in such circumstances to avoid the performance hit?
> 
> Thanks.
> From: [email protected] 
> <[email protected]> on behalf of Mark Shanks 
> <[email protected]>
> Sent: Wednesday, 21 September 2016 4:35:32 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
>  
> Hi Rob,
> 
> Your suggestion worked very well! Super fast, at least with the relatively 
> small dataset I'm using at present.
> 
> Thanks.
> From: [email protected] 
> <[email protected]> on behalf of Rob Szkutak 
> <[email protected]>
> Sent: Saturday, 17 September 2016 7:28:01 AM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates
>  
> Hi,
> 
> The fastest way to do that I can think of would be to index Data/Site, 
> Data/Department, Data/LOB, /Data/Audit_Date.
> 
> Next, you could use cts:value-tuples() to build your result set directly out 
> of the in-memory indexes without needing to pull document fragments . 
> Finally, you would just need to return your concatenation.
> 
> It would look something like this (Not tested) :
> 
> let $tuples :=
>   cts:value-tuples(
>     ( 
>     cts:element-reference(xs:QName("Site")),
>        cts:element-reference(xs:QName("Department")),
>     cts:element-reference(xs:QName("LOB"))
>     ),
>    (),
>    cts:and-query((
>      cts:element-range-query(xs:QName("Audit_Date"), ">", 
> xs:date("2010-01-01")),
>      cts:element-range-query(xs:QName("Audit_Date"), "<", 
> xs:date("2011-01-01")),
>      cts:or-query((
>          cts:element-value-query(xs:QName("Classification"), "Finding"),
>          cts:element-value-query(xs:QName("Classification"), "Observation")
>      ))
>    ))
>   )
> 
> for $each in $tuples
> return
>   fn:concat($each[1], |, $each[2], |, $each[3], cts:frequency($each))
> 
> Best,
> Rob
> 
> Rob Szkutak
> Senior Consultant
> MarkLogic Corporation
> [email protected]
> www.marklogic.com
> 
> From: [email protected] 
> [[email protected]] on behalf of Mark Shanks 
> [[email protected]]
> Sent: Friday, September 16, 2016 3:55 PM
> To: 'General MarkLogic Developer Discussion'
> Subject: [MarkLogic Dev General] Speeding up xquery returning aggregates
> 
> Hi,
> 
> I'm trying to find the best way to return the results of what would be the 
> following equivalent sql statement:
> 
> select count(*) from Data 
> where Audit_Date > "2010-01-01" and Audit_Date < "2011-01-01" and 
> (Classification = "Finding" or Classification = "Observation")
> group by Site, Department, LOB
> 
> I didn't test this sql statement, but it should give you the idea... Anyway, 
> I came up with the following xquery equivalent:
> 
> for $s in distinct-values(/Data/Site)
> return 
> for $d in distinct-values(/Data/Department)
> return  
> for $lob in distinct-values(/Data/LOB)
> return concat($s,'|',$d,'|',$lob,'|',
> count(
> for $x in (/Data[Site=$s and Department=$d and LOB=$lob and 
> (Classification='Finding' or Classification='Observation')])
> let $date as xs:dateTime := $x/Audit_Date
> where $date gt xs:dateTime("2010-01-01T00:00:00")
> and $date lt xs:dateTime("2011-01-01T00:00:00")  
> return ($x)
> )
> )
> 
> It works fine and is not super-slow, but isn't particularly fast either. Is 
> this the most efficient way to get this type of information out of marklogic? 
> Assuming the fields are indexed, would some search command be faster? Or 
> maybe subset the data better? 
> 
> Thanks,
> 
> Mark
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at: 
> http://developer.marklogic.com/mailman/listinfo/general





_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Speeding up xquery returning aggregates

Reply via email to