Okay, well that explains it. According to the limited documentation, it states that the -generate_uri parameter should "automatically generate document URIs". I took this to mean that it would function like an index field in SQL whereby each document would be given a unique ID and would prevent documents being overwritten because of identical URIs. I was erasing the file path as it is ugly and has nothing to do with the original data source. I did not realize that the parameter depended on a unique file path to be unique. I suspect it would be common for people to be ingesting from temp filenames setup by scripts like "extract.csv", so it seems like you are saying that mlcp would overwrite the documents each time a new ingestion is run. This is not an ideal solution, but can be worked around. What is definitely needed is for the documentation to give a lot more detail about the -generate_uri parameter including how it determines the URI and what is required in order to avoid documents being overwritten when using it.
From: [email protected] To: [email protected] Date: Wed, 30 Sep 2015 03:45:44 +0000 Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Hi Mark, Yes, the counter restarts, but the counter is normally prepended with the csv filename. Engineering also gave this comment: “The limit is int max val. The upper bound applies to each split. Here's the composition of a generated URI in mlcp local mode: <file path>-<split-start>-<id>.<xml or json>” Can you provide the command-line that you used? Kind regards, Geert From: <[email protected]> on behalf of Mark Shanks <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Tuesday, September 29, 2015 at 7:20 PM To: MarkLogic Developer Discussion <[email protected]> Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Thanks for looking into this. I got the same result twice (trying to ingest around 7 million records and only 1 million records exactly in the database). Running other datasets without the -generate_uri parameter did not lead to this problem. In the database with 1 million documents, the database status also showed a lot of deleted fragments. One possibly major difference with what I did vs your test is that I had 7 files with 1 million rows each in them, and then pointed mlcp at the folder to ingest all of the files. Maybe the generate_uri counter is restarted each time mlcp starts operating on a new file?? From: [email protected] To: [email protected] Date: Tue, 29 Sep 2015 22:46:35 +0000 Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Hi Mark, Engineering had a quick look, and couldn’t find anything that looks like a limitation. I also ran the following test. I created a csv with 10 mln records with this XQuery code in QC: xdmp:save("/tmp/test.csv", text{string-join(("test", for $i in 1 to 10000000 return string($i), " "), " ")}) Which I then ingested into some empty database using: mlcp import -input_file_path /tmp/test.csv -input_file_type delimited_text -generate_uri -output_collections test -output_uri_replace ".*/test.csv-,\'/test-\'" -output_uri_suffix .xml I stopped it after it went past 1 mln, but the following copy-paste from QC Explore shows what happens: /test-0-1000040.xml root (no properties) test /test-0-117608.xml root (no properties) test /test-0-120735.xml root (no properties) test /test-0-136891.xml root (no properties) test /test-0-154749.xml root (no properties) test /test-0-167917.xml root (no properties) test /test-0-227321.xml root (no properties) test /test-0-238699.xml root (no properties) test /test-0-24671.xml In other words, counting started at test-0-1, and appends more digits as needed. There is no upper bound, other than perhaps max int. Kind regards, Geert From: <[email protected]> on behalf of Geert Josten <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Tuesday, September 29, 2015 at 10:52 AM To: MarkLogic Developer Discussion <[email protected]> Subject: Re: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Hi Mark, I’m not entirely sure what algorithm is used underneath, but this is obviously an issue. I’ll file a bug for this. As a work-around you might use a transform, and override the uri with something like map:put($content, “uri”, concat(“/mypath/“, xdmp:random(), “.xml”)). xdmp:random() by default generates a 64-bit size random number, that should be big enough to go well over 1 mln. If you are paranoid you could use xdmp:exist to check if a doc with that id already exists. If you are not yet using transforms, that might add a bit of extra overhead.. Kind regards, Geert From: <[email protected]> on behalf of Mark Shanks <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Monday, September 28, 2015 at 8:58 PM To: "[email protected]" <[email protected]> Subject: [MarkLogic Dev General] Using -generate_uri with mlcp and lots of documents Hi, I'm using the -generate_uri switch with the marklogic content pump as the documents I have don't have any unique id's contained within them. However, I've found a big problem in that, if I use mlcp with more than a million documents, the uri's that are generated no longer become unique and the documents are overwritten - leading to a maximum of 1 million documents that you can ingest in this way. The problem is easy to see. -generate_uri creates a uri like -0-308950, varying the last 6 digits, so there are a maximum of a million combinations. -generate_uri doesn't seem to change the -0, or be smart enough to increase the number of digits when the maximum is hit, it just starts to overwrite existing documents. This seems to be a very flawed approach and an unworkable solution. Am I missing something? How does one generate over 1 million random unique uri's using mlcp? Thanks. _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
