Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

Phil Barber Tue, 11 Oct 2016 07:09:07 -0700

Hello Mark,

I'm using CoRB2 to produce CSV with about 1M rows.


1) I have a typical CoRB job that selects all the URIs that I'm interested
in. However, my "process" part of the CoRB job transforms the record into a
CSV string and then writes each record into one CSV file per record.
2) I have a static CSV file that only contains the header row for the CSV
file.
4) Then, when the CoRB job is complete, I have a shell script that merges
all those unique CSV files into a single monolithic CSV file. It looks like
this:

*rm* -f merged.csv; cat headers.csv > merged.csv; find exportData/ -name "*.
*csv*" | *xargs* -i cat {} >> merged.csv; *chmod* 660 merged.csv

Keep in mind, I have not attempted to run this on a larger dataset, so
scaling may be an issue.
Phil

-- 

Philip Barber
Senior Consultant
MarkLogic Corporation

On Tue, Oct 11, 2016 at 3:15 AM, Geert Josten <[email protected]>
wrote:

> Hi Mark,
>
> The best way to tackle this would be to parallelize output. Have 10 or
> more worker threads consume parts of the total (how many might depend on
> your cluster size, and the total amount of records you need to produce),
> and make each write a CSV on its own.
>
> The cts:search is a good starting point, but if you want to emit CSV
> anyhow, then don’t wrap the results of cts:search in a <data> element.
> Instead let each doc found from cts:search return one or more line-strings,
> which you don’t join either. MarkLogic will insert line-ends between such
> strings automatically, and this way it will allow for streaming.
>
> Doing it right, one worker should be able to produce a 1 mln record csv
> file in a few minute on an average laptop.
>
> At this point, I would worry less about using $x//Department, but assuming
> $x holds the document node, you could write $x/Record/Department. That
> would indeed be a little quicker.
>
> Not sure if Corb(2) can produce CSV, and if it would leverage parallelism
> in the same way as I meant, but it could be worth taking a look at
> cluster-based tools like Hadoop. Apache Camel might allow parallel
> processing too..
>
> Cheers,
> Geert
>
> From: <[email protected]> on behalf of Mark Shanks <
> [email protected]>
> Reply-To: MarkLogic Developer Discussion <[email protected]>
> Date: Tuesday, October 11, 2016 at 12:27 AM
> To: MarkLogic Developer Discussion <[email protected]>
>
> Subject: Re: [MarkLogic Dev General] How to pull data out of marklogic
> quickly?
>
> MLCP isn't an option as it doesn't provide text-delimited output.
> Text-delimited is a useful format as it allows the data to be pulled into
> practically any other application and with little overhead, unlike
> xml/json. Another problem with xml/json output other than compatibility is
> the file size. When a text-delimited file can be over 30GB with the data we
> are working with, the same data in xml or json becomes absolutely gigantic.
>
> What you say about $x//Department makes sense. If the data is in Marklogic
> as:
>
> <Record>
>       <Department>Sales</Department>
> </Record>
>
> What is the best way to get the Department value (i.e., fastest)?
> ------------------------------
> *From:* [email protected] <
> [email protected]> on behalf of Sekhon, Navdeep <
> [email protected]>
> *Sent:* Tuesday, 11 October 2016 6:22:44 AM
> *To:* [email protected]
> *Subject:* Re: [MarkLogic Dev General] How to pull data out of marklogic
> quickly?
>
> Have you looked into using MLCP? https://developer.marklogic.
> com/products/mlcp
>
> You can provide your cts query as an option to mlcp, get the documents out
> of ml and do your processing.
>
> Also, this $x//Department is an expensive operation. You should instead
> give the exact xpath.
>
> Regards,
>
> ns/.
>
> -----Original Message-----
> From: [email protected] [mailto:general-bounces@
> developer.marklogic.com <[email protected]>] On
> Behalf Of [email protected]
> Sent: Monday, October 10, 2016 3:00 PM
> To: [email protected]
> Subject: General Digest, Vol 148, Issue 13
>
> Send General mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://developer.marklogic.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific than
> "Re: Contents of General digest..."
>
>
> Today's Topics:
>
>    1. How to pull data out of marklogic quickly? (Mark Shanks)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 10 Oct 2016 18:43:52 +0000
> From: Mark Shanks <[email protected]>
> Subject: [MarkLogic Dev General] How to pull data out of marklogic
>         quickly?
> To: "[email protected]"
>         <[email protected]>
> Message-ID:
>         <PS1PR03MB1820D1A6779E41E6FFF4B16FE6DB0@PS1PR03MB1820.
> apcprd03.prod.outlook.com>
>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
>
> We have a need to pull large amounts of data out of marklogic as quickly
> as possible. I found that doing xquery searches like query-by-example were
> very slow. Using the cts functions led to a big speed increase. However, it
> isn't clear whether my current approach is the optimum, or whether there
> are other better alternatives. Unfortunately, while there is a lot of
> documentation describing many different ways of doing things in marklogic,
> there seems to be very little documentation describing what are the best or
> most efficient approaches (e.g., what if your goal is not only to run a
> query successfully, but to maximize its performance?). At present, I'm
> using the java api to pull documents. I'm using the theCall.xquery(query)
> function in Java to run custom xquery through the rest api. The xquery is
> as follows:
>
>
> <data>
> for $x in cts:search(fn:doc(),cts:and-query((
> cts:element-value-query(xs:QName('Department'), 'Sales'),
> cts:element-range-query(xs:QName('Date'), '>', xs:date('2015-01-01')),
> cts:element-range-query(xs:QName('Date'), '<', xs:date('2015-01-03')),
> cts:not-query(cts:element-value-query(xs:QName('Date'), 'NULL')) )),
> 'unfiltered' , 0.0) )), 'unfiltered' , 0.0) return
> fn:concat($x//Department,'|',$x//Total,'|',$x//Location'&#10;')}
> </data>
>
> There are indexes on Date and Department. The xquery wraps all of the
> documents in the <data> tags and sends the results to the java program. It
> then strips the <data> tags and prints the results to text file.
>
> I have found that you can run multiple threads in the java that request
> different "chunks" of the data by using the criterions of [1 to 1000000],
> [1000001 to 2000000], etc.
>
> This approach is much faster than our original approach - 12 hours with 8
> threads, rather than 75 hours using query-by-example. However, it is not
> clear if this is the fastest way, or there are further optimizations or
> better approaches. For instance, when pulling the actual elements from the
> documents, I found that having them indexed made no different to
> performance. Is there a way of pulling from the indexes to improve
> performance? Is there a way to specify the elements you want in the
> cts:search that will improve performance? Is there a more efficient way to
> restrict the search range? Is there documentation describing the most
> efficient approaches to querying marklogic?
>
> Thanks.
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://developer.marklogic.com/pipermail/general/
> attachments/20161010/d2e00150/attachment-0001.html
>
> ------------------------------
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> End of General Digest, Vol 148, Issue 13
> ****************************************
>
> This message and any attachments are intended only for the use of the
> addressee and may contain information that is privileged and confidential.
> If the reader of the message is not the intended recipient or an authorized
> representative of the intended recipient, you are hereby notified that any
> dissemination of this communication is strictly prohibited. If you have
> received this communication in error, please notify us immediately by
> e-mail and delete the message and any attachments from your system.
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to pull data out of marklogic quickly?

Reply via email to