Hi Mark, The best way to tackle this would be to parallelize output. Have 10 or more worker threads consume parts of the total (how many might depend on your cluster size, and the total amount of records you need to produce), and make each write a CSV on its own.
The cts:search is a good starting point, but if you want to emit CSV anyhow, then don’t wrap the results of cts:search in a <data> element. Instead let each doc found from cts:search return one or more line-strings, which you don’t join either. MarkLogic will insert line-ends between such strings automatically, and this way it will allow for streaming. Doing it right, one worker should be able to produce a 1 mln record csv file in a few minute on an average laptop. At this point, I would worry less about using $x//Department, but assuming $x holds the document node, you could write $x/Record/Department. That would indeed be a little quicker. Not sure if Corb(2) can produce CSV, and if it would leverage parallelism in the same way as I meant, but it could be worth taking a look at cluster-based tools like Hadoop. Apache Camel might allow parallel processing too.. Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of Mark Shanks <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Tuesday, October 11, 2016 at 12:27 AM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic quickly? MLCP isn't an option as it doesn't provide text-delimited output. Text-delimited is a useful format as it allows the data to be pulled into practically any other application and with little overhead, unlike xml/json. Another problem with xml/json output other than compatibility is the file size. When a text-delimited file can be over 30GB with the data we are working with, the same data in xml or json becomes absolutely gigantic. What you say about $x//Department makes sense. If the data is in Marklogic as: <Record> <Department>Sales</Department> </Record> What is the best way to get the Department value (i.e., fastest)? ________________________________ From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> on behalf of Sekhon, Navdeep <[email protected]<mailto:[email protected]>> Sent: Tuesday, 11 October 2016 6:22:44 AM To: [email protected]<mailto:[email protected]> Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic quickly? Have you looked into using MLCP? https://developer.marklogic.com/products/mlcp You can provide your cts query as an option to mlcp, get the documents out of ml and do your processing. Also, this $x//Department is an expensive operation. You should instead give the exact xpath. Regards, ns/. -----Original Message----- From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of [email protected]<mailto:[email protected]> Sent: Monday, October 10, 2016 3:00 PM To: [email protected]<mailto:[email protected]> Subject: General Digest, Vol 148, Issue 13 Send General mailing list submissions to [email protected]<mailto:[email protected]> To subscribe or unsubscribe via the World Wide Web, visit http://developer.marklogic.com/mailman/listinfo/general or, via email, send a message with subject or body 'help' to [email protected]<mailto:[email protected]> You can reach the person managing the list at [email protected]<mailto:[email protected]> When replying, please edit your Subject line so it is more specific than "Re: Contents of General digest..." Today's Topics: 1. How to pull data out of marklogic quickly? (Mark Shanks) ---------------------------------------------------------------------- Message: 1 Date: Mon, 10 Oct 2016 18:43:52 +0000 From: Mark Shanks <[email protected]<mailto:[email protected]>> Subject: [MarkLogic Dev General] How to pull data out of marklogic quickly? To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Message-ID: <ps1pr03mb1820d1a6779e41e6fff4b16fe6...@ps1pr03mb1820.apcprd03.prod.outlook.com<mailto:ps1pr03mb1820d1a6779e41e6fff4b16fe6...@ps1pr03mb1820.apcprd03.prod.outlook.com>> Content-Type: text/plain; charset="iso-8859-1" Hi, We have a need to pull large amounts of data out of marklogic as quickly as possible. I found that doing xquery searches like query-by-example were very slow. Using the cts functions led to a big speed increase. However, it isn't clear whether my current approach is the optimum, or whether there are other better alternatives. Unfortunately, while there is a lot of documentation describing many different ways of doing things in marklogic, there seems to be very little documentation describing what are the best or most efficient approaches (e.g., what if your goal is not only to run a query successfully, but to maximize its performance?). At present, I'm using the java api to pull documents. I'm using the theCall.xquery(query) function in Java to run custom xquery through the rest api. The xquery is as follows: <data> for $x in cts:search(fn:doc(),cts:and-query(( cts:element-value-query(xs:QName('Department'), 'Sales'), cts:element-range-query(xs:QName('Date'), '>', xs:date('2015-01-01')), cts:element-range-query(xs:QName('Date'), '<', xs:date('2015-01-03')), cts:not-query(cts:element-value-query(xs:QName('Date'), 'NULL')) )), 'unfiltered' , 0.0) )), 'unfiltered' , 0.0) return fn:concat($x//Department,'|',$x//Total,'|',$x//Location' ')} </data> There are indexes on Date and Department. The xquery wraps all of the documents in the <data> tags and sends the results to the java program. It then strips the <data> tags and prints the results to text file. I have found that you can run multiple threads in the java that request different "chunks" of the data by using the criterions of [1 to 1000000], [1000001 to 2000000], etc. This approach is much faster than our original approach - 12 hours with 8 threads, rather than 75 hours using query-by-example. However, it is not clear if this is the fastest way, or there are further optimizations or better approaches. For instance, when pulling the actual elements from the documents, I found that having them indexed made no different to performance. Is there a way of pulling from the indexes to improve performance? Is there a way to specify the elements you want in the cts:search that will improve performance? Is there a more efficient way to restrict the search range? Is there documentation describing the most efficient approaches to querying marklogic? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://developer.marklogic.com/pipermail/general/attachments/20161010/d2e00150/attachment-0001.html ------------------------------ _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general End of General Digest, Vol 148, Issue 13 **************************************** This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system. _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
