Thanks for looking into this. I got the same result twice (trying to ingest around 7 million records and only 1 million records exactly in the database). Running other datasets without the -generate_uri parameter did not lead to this problem. In the database with 1 million documents, the database status also showed a lot of deleted fragments. One possibly major difference with what I did vs your test is that I had 7 files with 1 million rows each in them, and then pointed mlcp at the folder to ingest all of the files. Maybe the generate_uri counter is restarted each time mlcp starts operating on a new file??
From: [email protected] To: [email protected] Date: Tue, 29 Sep 2015 22:46:35 +0000 Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Hi Mark, Engineering had a quick look, and couldn’t find anything that looks like a limitation. I also ran the following test. I created a csv with 10 mln records with this XQuery code in QC: xdmp:save("/tmp/test.csv", text{string-join(("test", for $i in 1 to 10000000 return string($i), " "), " ")}) Which I then ingested into some empty database using: mlcp import -input_file_path /tmp/test.csv -input_file_type delimited_text -generate_uri -output_collections test -output_uri_replace ".*/test.csv-,\'/test-\'" -output_uri_suffix .xml I stopped it after it went past 1 mln, but the following copy-paste from QC Explore shows what happens: /test-0-1000040.xml root (no properties) test /test-0-117608.xml root (no properties) test /test-0-120735.xml root (no properties) test /test-0-136891.xml root (no properties) test /test-0-154749.xml root (no properties) test /test-0-167917.xml root (no properties) test /test-0-227321.xml root (no properties) test /test-0-238699.xml root (no properties) test /test-0-24671.xml In other words, counting started at test-0-1, and appends more digits as needed. There is no upper bound, other than perhaps max int. Kind regards, Geert From: <[email protected]> on behalf of Geert Josten <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Tuesday, September 29, 2015 at 10:52 AM To: MarkLogic Developer Discussion <[email protected]> Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Hi Mark, I’m not entirely sure what algorithm is used underneath, but this is obviously an issue. I’ll file a bug for this. As a work-around you might use a transform, and override the uri with something like map:put($content, “uri”, concat(“/mypath/“, xdmp:random(), “.xml”)). xdmp:random() by default generates a 64-bit size random number, that should be big enough to go well over 1 mln. If you are paranoid you could use xdmp:exist to check if a doc with that id already exists. If you are not yet using transforms, that might add a bit of extra overhead.. Kind regards, Geert From: <[email protected]> on behalf of Mark Shanks <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Monday, September 28, 2015 at 8:58 PM To: "[email protected]" <[email protected]> Subject: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Hi, I'm using the -generate_uri switch with the marklogic content pump as the documents I have don't have any unique id's contained within them. However, I've found a big problem in that, if I use mlcp with more than a million documents, the uri's that are generated no longer become unique and the documents are overwritten - leading to a maximum of 1 million documents that you can ingest in this way. The problem is easy to see. -generate_uri creates a uri like -0-308950, varying the last 6 digits, so there are a maximum of a million combinations. -generate_uri doesn't seem to change the -0, or be smart enough to increase the number of digits when the maximum is hit, it just starts to overwrite existing documents. This seems to be a very flawed approach and an unworkable solution. Am I missing something? How does one generate over 1 million random unique uri's using mlcp? Thanks. _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
