Hi Mark,

Yes, the counter restarts, but the counter is normally prepended with the csv 
filename. Engineering also gave this comment:

“The limit is int max val. The upper bound applies to each split. Here's the 
composition of a generated URI in mlcp local mode: <file 
path>-<split-start>-<id>.<xml or json>”

Can you provide the command-line that you used?

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Mark Shanks 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, September 29, 2015 at 7:20 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Thanks for looking into this. I got the same result twice (trying to ingest 
around 7 million records and only 1 million records exactly in the database). 
Running other datasets without the -generate_uri parameter did not lead to this 
problem. In the database with 1 million documents, the database status also 
showed a lot of deleted fragments.

One possibly major difference with what I did vs your test is that I had 7 
files with 1 million rows each in them, and then pointed mlcp at the folder to 
ingest all of the files. Maybe the generate_uri counter is restarted each time 
mlcp starts operating on a new file??

________________________________
From: [email protected]<mailto:[email protected]>
To: [email protected]<mailto:[email protected]>
Date: Tue, 29 Sep 2015 22:46:35 +0000
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Hi Mark,

Engineering had a quick look, and couldn’t find anything that looks like a 
limitation. I also ran the following test.

I created a csv with 10 mln records with this XQuery code in QC:

xdmp:save("/tmp/test.csv", text{string-join(("test", for $i in 1 to 10000000 
return string($i), "&#10;"), "&#10;")})

Which I then ingested into some empty database using:


mlcp import -input_file_path /tmp/test.csv -input_file_type delimited_text 
-generate_uri -output_collections test -output_uri_replace 
".*/test.csv-,\'/test-\'" -output_uri_suffix .xml


I stopped it after it went past 1 mln, but the following copy-paste from QC 
Explore shows what happens:

/test-0-1000040.xml     [element]  root (no properties) test
/test-0-117608.xml      [element]  root (no properties) test
/test-0-120735.xml      [element]  root (no properties) test
/test-0-136891.xml      [element]  root (no properties) test
/test-0-154749.xml      [element]  root (no properties) test
/test-0-167917.xml      [element]  root (no properties) test
/test-0-227321.xml      [element]  root (no properties) test
/test-0-238699.xml      [element]  root (no properties) test
/test-0-24671.xml


In other words, counting started at test-0-1, and appends more digits as 
needed. There is no upper bound, other than perhaps max int.

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Geert Josten 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, September 29, 2015 at 10:52 AM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Hi Mark,

I’m not entirely sure what algorithm is used underneath, but this is obviously 
an issue. I’ll file a bug for this.

As a work-around you might use a transform, and override the uri with something 
like map:put($content, “uri”, concat(“/mypath/“, xdmp:random(), “.xml”)). 
xdmp:random() by default generates a 64-bit size random number, that should be 
big enough to go well over 1 mln. If you are paranoid you could use xdmp:exist 
to check if a doc with that id already exists.

If you are not yet using transforms, that might add a bit of extra overhead..

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Mark Shanks 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Monday, September 28, 2015 at 8:58 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Hi,

I'm using the -generate_uri switch with the marklogic content pump as the 
documents I have don't have any unique id's contained within them. However, 
I've found a big problem in that, if I use mlcp with more than a million 
documents, the uri's that are generated no longer become unique and the 
documents are overwritten - leading to a maximum of 1 million documents that 
you can ingest in this way.

The problem is easy to see. -generate_uri creates a uri like -0-308950, varying 
the last 6 digits, so there are a maximum of a million combinations. 
-generate_uri doesn't seem to change the -0, or be smart enough to increase the 
number of digits when the maximum is hit, it just starts to overwrite existing 
documents.

This seems to be a very flawed approach and an unworkable solution. Am I 
missing something? How does one generate over 1 million random unique uri's 
using mlcp?

Thanks.

_______________________________________________ General mailing list 
[email protected]<mailto:[email protected]> Manage 
your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to