Thanks! This helped me prevent the errors from occurring and - as a bonus -
significantly sped up my ingestion.

I couldn't use exactly the mlcp command line you suggested, since - in the
version of mlcp I'm using - -input_file_type xml isn't allowed, I had to
use -input_file_type documents instead. Also, my input files don't need to
be split. However, bumping up the threads used (to 30 in my case) made the
transaction / timeout complaints go away. And now I'm ingesting 100,000
documents in 12 minutes, rather than one hour. Much better!

Regards,

Stuart



On Fri, Sep 23, 2016 at 3:34 AM, Jain, Abhishek <
abhishek.b.j...@capgemini.com> wrote:

> Hi Stuart,
>
>
>
> MLCP comes with various options, and can be used in  various combinations
> depending on the file size, memory available and
>
> Other number of nodes, forest etc.
>
>
>
> If you want to try a quick solution you can try this mlcp command :
>
> *mlcp import -host yourhost -port 8000 -username userName -password
> PASSWORD -input_file_type xml -input_file_path TempData -thread_count
>  -thread_count_per_split 3 -batch_size 200  -transaction_size 20
> -max_split_size 33554432 -split_input true*
>
> change username, input file type etc accordingly.
>
> It’s always good to use splits and threads when working with huge dataset.
>
> Some performance matrix you can consider while using above mlcp :
>
> 1.       In app server settings you can check if connection time out is
> set to 0.
>
> 2.       Default spilt size is 32MB, if you can change *-max_split_size
> 33554432 *( it take in bytes, if your file is bigger )
>
> 3.       Make sure split and thread ratio remains 1:2 or 1:3 for example
>
> If your document size is 10 MB, and your split size is 1000,000 (1 MB)
> then 10/1 = 10 splits
>
> Then you should create 20 or 30 thread for best CPU utilization.
>
> 4.       The above mlcp does well with 150 Million rows, should work for
> you as well.
>
> 5.       I assume you have a nice good RAM > 4GB alteast.
>
>
>
> Thanks and Regards,
>
> [image: Email_CBE.gif]Abhishek Jain
>
> Associate Consultant
>
> Capgemini India | Hyderabad
>
>
>
> *From:* general-boun...@developer.marklogic.com [mailto:general-bounces@
> developer.marklogic.com] *On Behalf Of *Stuart Myles
> *Sent:* Thursday, September 22, 2016 11:52 PM
> *To:* MarkLogic Developer Discussion
> *Subject:* [MarkLogic Dev General] mlcp Transaction Errors - SVC-EXTIME
> and XDMP-NOTXN
>
>
>
> When I'm loading directories of slightly fewer than 100,000 XML files into
> a large MarkLogic instance, I often get timeout and transaction errors. If
> I re-run the same directory of files which got those errors, I typically
> don't get any errors.
>
>
>
> So, I have a few questions:
>
>
>
> * Can I get prevent the errors from happening in the first place - e.g. by
> tuning MarkLogic parameters or altering my use of mlcp?
>
> * If I do get errors, what is the best way to get a report on the files
> which failed, so I can retry just those ones? Is the best option for me to
> write some code to pick out the errors from the log file? And, if so, am I
> guaranteed to get all of the files reported?
>
>
>
> Some Details
>
>
>
> The command line template is
>
>
>
> mlcp.sh import -username {1} -password {2} -host localhost -port {4}
> -input_file_path {5} -output_uri_replace \"{6},'{7}'\"
>
>
>
> Sometimes, the imports run just fine. However, often I get a large number
> of SVC-EXTIME errors followed by a XDMP-NOTXN error. For example:
>
>
>
> 16/09/22 17:54:03 ERROR mapreduce.ContentWriter: SVC-EXTIME: Time limit
> exceeded
>
> 16/09/22 17:54:03 WARN mapreduce.ContentWriter: Failed document
> 029ccd8ac3323658277ca28fead7a73d.0.xml in file:/mnt/ingestion/
> MarkLogicIngestion/smyles/todo/2014_0005.done/
> 029ccd8ac3323658277ca28fead7a73d.0.xml
>
> 16/09/22 17:54:03 ERROR mapreduce.ContentWriter: SVC-EXTIME: Time limit
> exceeded
>
> 16/09/22 17:54:03 WARN mapreduce.ContentWriter: Failed document
> 02eb4562784255e249c4ec3ed472f9aa.1.xml in file:/mnt/ingestion/
> MarkLogicIngestion/smyles/todo/2014_0005.done/
> 02eb4562784255e249c4ec3ed472f9aa.1.xml
>
> 16/09/22 17:54:04 INFO contentpump.LocalJobRunner:  completed 33%
>
> 16/09/22 17:54:21 ERROR mapreduce.ContentWriter: XDMP-NOTXN: No
> transaction with identifier 9076269665213828952
>
>
>
> So far, I'm just rerunning the entire directory again. Most of the time,
> it ingests fine on the second attempt. However, I have thousands of these
> directories to process. So, I would prefer to avoid getting the errors in
> the first place. Failing that, I would like to capture the errors and just
> retry the files which failed.
>
>
>
> Any help much appreciated.
>
>
> Regards,
>
> Stuart
>
>
>
>
>
> This message contains information that may be privileged or confidential
> and is the property of the Capgemini Group. It is intended only for the
> person to whom it is addressed. If you are not the intended recipient, you
> are not authorized to read, print, retain, copy, disseminate, distribute,
> or use this message or any part thereof. If you receive this message in
> error, please notify the sender immediately and delete all copies of this
> message.
>
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to