Hi, I thought in your earlier email you implied that many values were fine with a 10,000 document set but slowed down when run against several million? This lead me to believe the slowdown is caused by returning too many tuples.
A simple test to confirm if its a problem with the size of the result set would be to limit the size of the result set and see if your performance improves. Best, Rob Rob Szkutak Senior Consultant MarkLogic Corporation rob.szku...@marklogic.com www.marklogic.com<http://www.marklogic.com> ________________________________ From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Mark Shanks [markshanks...@hotmail.com] Sent: Thursday, September 22, 2016 7:02 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates Thanks. The point is that the execution time isn't increasing at an exponential rate. Note also that each of the facets had about the same number of entries, so it isn't as if the number of tuples increased from, e.g., 50 to 4 million. I find it interesting that marklogic has a separate statement cts:value-co-occurrences for looking at effectively 2 facets. Seems maybe that 2 facets are cached in some way or some shortcut is provided for their computation, whereas more than 2 needs to go a longer way that requires much more processing than either 1 or 2. ________________________________ From: general-boun...@developer.marklogic.com <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak <rob.szku...@marklogic.com> Sent: Friday, 23 September 2016 4:58:01 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates Hi, As you add more values to the value-tuples call, you will typically exponentially increase the amount of results you receive. The total number of results will be the total number of all possible unique combinations of all values. More values means more unique combinations of all values. If your code you had : for $each in $tuples return fn:concat() If you have 4 million documents, you could be returning 4 million tuples at most or easily returning some other number of tuples in the millions. If you wrote code in any platform that did something like "for each tuple in a set of millions do something" then you will expect that processing to take some time. So, what are your options? 1) You could order your tuples by the most (or least) common ones and then paginate the results, returning a much smaller number for each page. 2) You could cache the information as data is ingested into a document and then pull that document instead of doing all the work to figure it out on the fly. 3) You could investigate upgrading your hardware and see if that helps the processing complete more quickly. I would personally recommend #1 . If you're getting back a large number of results, you'll absolutely find #1 to be the most navigable alternative. Best, Rob Rob Szkutak Senior Consultant MarkLogic Corporation rob.szku...@marklogic.com www.marklogic.com<http://www.marklogic.com> ________________________________ From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Mark Shanks [markshanks...@hotmail.com] Sent: Thursday, September 22, 2016 1:32 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates As a follow-up, we found that the query was super fast with a small dataset (e.g., 10,000 records). On the other hand, with a large dataset (40 million, and pulling around 1 milllion records), we found that the query would be super fast with 1 or 2 facets, e.g.: let $tuples := cts:value-tuples( ( cts:element-reference(xs:QName("Site")) ), (), cts:and-query(( cts:element-range-query(xs:QName("Audit_Date"), ">", xs:date("2010-01-01")), cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01")) )) ) or let $tuples := cts:value-tuples( ( cts:element-reference(xs:QName("Site")), cts:element-reference(xs:QName("Department")) ), (), cts:and-query(( cts:element-range-query(xs:QName("Audit_Date"), ">", xs:date("2010-01-01")), cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01")) )) ) but would take a massive performance hit once the facets are increased to 3, and 4 was much slower again. E.g.: let $tuples := cts:value-tuples( ( cts:element-reference(xs:QName("Site")), cts:element-reference(xs:QName("Department")), cts:element-reference(xs:QName("LOB")) ), (), cts:and-query(( cts:element-range-query(xs:QName("Audit_Date"), ">", xs:date("2010-01-01")), cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01")) )) ) By performance hit, I mean the first two queries would take 1 second each. Pulling 3 facets would take 250 seconds, and pulling 4 facets would take 350 seconds. Anyone have any idea of what is going on under the hood to lead to such a breaking point between 1-2 facets and more facets? Any better way to do the query in such circumstances to avoid the performance hit? Thanks. ________________________________ From: general-boun...@developer.marklogic.com <general-boun...@developer.marklogic.com> on behalf of Mark Shanks <markshanks...@hotmail.com> Sent: Wednesday, 21 September 2016 4:35:32 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates Hi Rob, Your suggestion worked very well! Super fast, at least with the relatively small dataset I'm using at present. Thanks. ________________________________ From: general-boun...@developer.marklogic.com <general-boun...@developer.marklogic.com> on behalf of Rob Szkutak <rob.szku...@marklogic.com> Sent: Saturday, 17 September 2016 7:28:01 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Speeding up xquery returning aggregates Hi, The fastest way to do that I can think of would be to index Data/Site, Data/Department, Data/LOB, /Data/Audit_Date. Next, you could use cts:value-tuples() to build your result set directly out of the in-memory indexes without needing to pull document fragments . Finally, you would just need to return your concatenation. It would look something like this (Not tested) : let $tuples := cts:value-tuples( ( cts:element-reference(xs:QName("Site")), cts:element-reference(xs:QName("Department")), cts:element-reference(xs:QName("LOB")) ), (), cts:and-query(( cts:element-range-query(xs:QName("Audit_Date"), ">", xs:date("2010-01-01")), cts:element-range-query(xs:QName("Audit_Date"), "<", xs:date("2011-01-01")), cts:or-query(( cts:element-value-query(xs:QName("Classification"), "Finding"), cts:element-value-query(xs:QName("Classification"), "Observation") )) )) ) for $each in $tuples return fn:concat($each[1], |, $each[2], |, $each[3], cts:frequency($each)) Best, Rob Rob Szkutak Senior Consultant MarkLogic Corporation rob.szku...@marklogic.com www.marklogic.com<http://www.marklogic.com> ________________________________ From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Mark Shanks [markshanks...@hotmail.com] Sent: Friday, September 16, 2016 3:55 PM To: 'General MarkLogic Developer Discussion' Subject: [MarkLogic Dev General] Speeding up xquery returning aggregates Hi, I'm trying to find the best way to return the results of what would be the following equivalent sql statement: select count(*) from Data where Audit_Date > "2010-01-01" and Audit_Date < "2011-01-01" and (Classification = "Finding" or Classification = "Observation") group by Site, Department, LOB I didn't test this sql statement, but it should give you the idea... Anyway, I came up with the following xquery equivalent: for $s in distinct-values(/Data/Site) return for $d in distinct-values(/Data/Department) return for $lob in distinct-values(/Data/LOB) return concat($s,'|',$d,'|',$lob,'|', count( for $x in (/Data[Site=$s and Department=$d and LOB=$lob and (Classification='Finding' or Classification='Observation')]) let $date as xs:dateTime := $x/Audit_Date where $date gt xs:dateTime("2010-01-01T00:00:00") and $date lt xs:dateTime("2011-01-01T00:00:00") return ($x) ) ) It works fine and is not super-slow, but isn't particularly fast either. Is this the most efficient way to get this type of information out of marklogic? Assuming the fields are indexed, would some search command be faster? Or maybe subset the data better? Thanks, Mark
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general