Unless your document's PI data is separated into different documents you are going to need to do a custom transformation on each document - the details of which are very case specific (fill in SS#'s with '???' remove last names ? remove entire sections or replace with sample data ?). Having worked in the Medical and commerce worlds I know getting this right, and clearly auditable are crucial. Also consider if you need to maintain any document properties or metadata (properties objects including mod dates, collections, permissions , DLS data etc., and are these copied as-is or modified)
That refines the question into parts 1) Selecting the document subset to copy 2) Transforming the document content itself (*prior* to leaving the 'trust zone') 3) Select/copy/filter the document metadata 4) Extract from the source DB 5) -- possibly package for secure, reliable or easy travel to the down sites, encrypt? 6) -- Copy the data .... > Now reverse the process on the target site. You can do all this ad-hoc - once maybe Getting this reliable, scriptable, auditable and not screw up ever -- harder. Greet's suggestion of FlexRep seems ideal for this as it can accomplish All of these. MLCP by itself can do quite a bit - but it may be hard to put all the pieces together. Another way is making a temporary DB, and using CPF or your own code to do all the data transformation on-server then (1-4) then use any number of ways to copy the data (mlcp, replication, database export/import ) Or ... if you prefer offline tools (say you like xproc or xmlsh or other non-server products) you could dump the DB to local files, clean them in in place, then copy them over and reverse it. FlexRep is looking really good though ... ----------------------------------------------------------------------------- David Lee Lead Engineer MarkLogic Corporation [email protected] Phone: +1 812-482-5224 Cell: +1 812-630-7622 www.marklogic.com -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Geert Josten Sent: Tuesday, March 24, 2015 2:00 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Suggestions for data masking Hi Joel, I haven¹t dealt with this personally, but could ask around. I guess though there are numerous ways to go about with this, depending on the exact needs. The two that come to mind first: You could create a permanent solution using Flexible Replication, which builds on top of CPF: http://docs.marklogic.com/guide/flexrep/rep_intro#id_62963 You could also use MLCP copying feature together with an MLCP transform. You already mentioned triggers and scheduled tasks, but MLCP will load faster I think. CPF uses triggers underneath.. Kind regards, Geert On 3/24/15, 2:12 AM, "Joel Wilson Gunasekaran" <[email protected]> wrote: >Hi, > >Once in a while, we refresh dataset in lower environments with >production data for testing purposes. >We have a requirement to mask all pii(personally identifiable >information) data like email id, phone number, etc. in lower >environments like DEV, QA. > >We were thinking about having a one-time script that does the masking, >which can be run when we do the data refresh. >In addition to this, we also want a automated process that does this, >like either a scheduled task or a trigger, to avoid any sensitive data >left unmasked, accidentally. > >Can you please let me know if you have had to deal with similar cases >and any suggestions? > >Thanks >Joel >_______________________________________________ >General mailing list >[email protected] >http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
