I’ll forward your suggestion.

Thanks,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Mark Shanks 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, September 30, 2015 at 12:29 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Okay, well that explains it. According to the limited documentation, it states 
that the -generate_uri parameter should "automatically generate document URIs". 
I took this to mean that it would function like an index field in SQL whereby 
each document would be given a unique ID and would prevent documents being 
overwritten because of identical URIs. I was erasing the file path as it is 
ugly and has nothing to do with the original data source. I did not realize 
that the parameter depended on a unique file path to be unique. I suspect it 
would be common for people to be ingesting from temp filenames setup by scripts 
like "extract.csv", so it seems like you are saying that mlcp would overwrite 
the documents each time a new ingestion is run.

This is not an ideal solution, but can be worked around. What is definitely 
needed is for the documentation to give a lot more detail about the 
-generate_uri parameter including how it determines the URI and what is 
required in order to avoid documents being overwritten when using it.

________________________________
From: [email protected]<mailto:[email protected]>
To: [email protected]<mailto:[email protected]>
Date: Wed, 30 Sep 2015 03:45:44 +0000
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Hi Mark,

Yes, the counter restarts, but the counter is normally prepended with the csv 
filename. Engineering also gave this comment:

“The limit is int max val. The upper bound applies to each split. Here's the 
composition of a generated URI in mlcp local mode: <file 
path>-<split-start>-<id>.<xml or json>”

Can you provide the command-line that you used?

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Mark Shanks 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, September 29, 2015 at 7:20 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Thanks for looking into this. I got the same result twice (trying to ingest 
around 7 million records and only 1 million records exactly in the database). 
Running other datasets without the -generate_uri parameter did not lead to this 
problem. In the database with 1 million documents, the database status also 
showed a lot of deleted fragments.

One possibly major difference with what I did vs your test is that I had 7 
files with 1 million rows each in them, and then pointed mlcp at the folder to 
ingest all of the files. Maybe the generate_uri counter is restarted each time 
mlcp starts operating on a new file??

________________________________
From: [email protected]<mailto:[email protected]>
To: [email protected]<mailto:[email protected]>
Date: Tue, 29 Sep 2015 22:46:35 +0000
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Hi Mark,

Engineering had a quick look, and couldn’t find anything that looks like a 
limitation. I also ran the following test.

I created a csv with 10 mln records with this XQuery code in QC:

xdmp:save("/tmp/test.csv", text{string-join(("test", for $i in 1 to 10000000 
return string($i), "&#10;"), "&#10;")})

Which I then ingested into some empty database using:


mlcp import -input_file_path /tmp/test.csv -input_file_type delimited_text 
-generate_uri -output_collections test -output_uri_replace 
".*/test.csv-,\'/test-\'" -output_uri_suffix .xml


I stopped it after it went past 1 mln, but the following copy-paste from QC 
Explore shows what happens:

/test-0-1000040.xml     [element]  root (no properties) test
/test-0-117608.xml      [element]  root (no properties) test
/test-0-120735.xml      [element]  root (no properties) test
/test-0-136891.xml      [element]  root (no properties) test
/test-0-154749.xml      [element]  root (no properties) test
/test-0-167917.xml      [element]  root (no properties) test
/test-0-227321.xml      [element]  root (no properties) test
/test-0-238699.xml      [element]  root (no properties) test
/test-0-24671.xml


In other words, counting started at test-0-1, and appends more digits as 
needed. There is no upper bound, other than perhaps max int.

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Geert Josten 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, September 29, 2015 at 10:52 AM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Hi Mark,

I’m not entirely sure what algorithm is used underneath, but this is obviously 
an issue. I’ll file a bug for this.

As a work-around you might use a transform, and override the uri with something 
like map:put($content, “uri”, concat(“/mypath/“, xdmp:random(), “.xml”)). 
xdmp:random() by default generates a 64-bit size random number, that should be 
big enough to go well over 1 mln. If you are paranoid you could use xdmp:exist 
to check if a doc with that id already exists.

If you are not yet using transforms, that might add a bit of extra overhead..

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Mark Shanks 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Monday, September 28, 2015 at 8:58 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of 
documents

Hi,

I'm using the -generate_uri switch with the marklogic content pump as the 
documents I have don't have any unique id's contained within them. However, 
I've found a big problem in that, if I use mlcp with more than a million 
documents, the uri's that are generated no longer become unique and the 
documents are overwritten - leading to a maximum of 1 million documents that 
you can ingest in this way.

The problem is easy to see. -generate_uri creates a uri like -0-308950, varying 
the last 6 digits, so there are a maximum of a million combinations. 
-generate_uri doesn't seem to change the -0, or be smart enough to increase the 
number of digits when the maximum is hit, it just starts to overwrite existing 
documents.

This seems to be a very flawed approach and an unworkable solution. Am I 
missing something? How does one generate over 1 million random unique uri's 
using mlcp?

Thanks.

_______________________________________________ General mailing list 
[email protected]<mailto:[email protected]> Manage 
your subscription at: http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________ General mailing list 
[email protected]<mailto:[email protected]> Manage 
your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to