Re: [MarkLogic Dev General] Suggestions for data masking

David Lee Tue, 24 Mar 2015 02:18:07 -0700

Unless your document's PI data is separated into different documents you are 
going to need to do a custom transformation on each document - the details of 
which are very case specific (fill in SS#'s with '???' remove last names ? 
remove entire sections or replace with sample data ?).   Having worked in the 
Medical and commerce worlds I know getting this right, and clearly auditable 
are crucial.
Also consider if you need to maintain any document properties or metadata 
(properties objects including mod dates,  collections, permissions , DLS data 
etc.,
and are these copied as-is or modified)

That refines the question into parts
1) Selecting the document subset to copy 
2) Transforming the document content itself (*prior* to leaving the 'trust 
zone')
3) Select/copy/filter the document metadata
4) Extract from the source DB 
5) -- possibly package for secure, reliable or easy travel to the down sites, 
encrypt?
6) -- Copy the data
.... > Now reverse the process on the target site.

You can do all this ad-hoc - once maybe
Getting this reliable, scriptable, auditable and not screw up ever -- harder.

Greet's suggestion of FlexRep seems ideal for this as it can accomplish All of 
these.

MLCP by itself can do quite a bit - but it may be hard to put all the pieces 
together.

Another way is making a temporary DB, and using CPF or your own code to do all 
the data transformation on-server then (1-4) then use any number of ways to 
copy the data (mlcp, replication, database export/import )

Or ... if you prefer offline tools (say you like xproc or xmlsh or other 
non-server products) you could dump the DB to local files, clean them in in 
place, 
then copy them over and reverse it.

FlexRep is looking really good though  ... 

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Geert Josten
Sent: Tuesday, March 24, 2015 2:00 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Suggestions for data masking

Hi Joel,

I haven¹t dealt with this personally, but could ask around. I guess though 
there are numerous ways to go about with this, depending on the exact needs. 
The two that come to mind first:

You could create a permanent solution using Flexible Replication, which builds 
on top of CPF:
http://docs.marklogic.com/guide/flexrep/rep_intro#id_62963

You could also use MLCP copying feature together with an MLCP transform.

You already mentioned triggers and scheduled tasks, but MLCP will load faster I 
think. CPF uses triggers underneath..

Kind regards,
Geert

On 3/24/15, 2:12 AM, "Joel Wilson Gunasekaran"
<[email protected]> wrote:

>Hi,
>
>Once in a while, we refresh dataset in lower environments with 
>production data for testing purposes.
>We have a requirement to mask all pii(personally identifiable
>information) data like email id, phone number, etc. in lower 
>environments like DEV, QA.
>
>We were thinking about having a one-time script that does the masking, 
>which can be run when we do the data refresh.
>In addition to this, we also want a automated process that does this, 
>like either a scheduled task or a trigger, to avoid any sensitive data 
>left unmasked, accidentally.
>
>Can you please let me know if you have had to deal with similar cases 
>and any suggestions?
>
>Thanks
>Joel
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Suggestions for data masking

Reply via email to