Vyshali,

You may be interested in format preserving encryption (FPE) [1] if you need to 
maintain format while performing data masking. There are also methods to derive 
a cryptographically secure hash function from encryption [2] so that you can 
have “one way” data transformation and maintain a given format.

I would encourage you to be aware of all attack surfaces here, though. First, 
there are many examples of anonymization being easily undone because it was not 
correctly implemented [3], used a weak process [4], or could be reconstructed 
through associated data [5]. Even with a strong anonymization approach, 
remember that NiFi tracks the data lineage throughout the process, so a user 
with sufficient permissions will be able to look at the provenance for a 
flowfile before/after it has undergone the anonymization operation and see the 
original data. This can be partially mitigated and restricted to a core group 
of privileged users via strict access control policies. On top of that, the 
provenance repository does provide an encrypted implementation, but the content 
and flowfile repositories currently do not. A malicious user with OS-level 
access could examine the repository files on disk to extract the original 
content or flowfile attributes before they were anonymized. There are open 
Jiras [6][7] for those efforts. There is also the issue of a user examining the 
flowfile via queue listing. Open Jiras for encrypting attributes [8] and 
hashing attributes [9], as well as “sensitive attributes” with 
per-key-permissions also exist [10].

I hope this helps to illustrate the complexities of anonymization and leads you 
to a successful solution.


[1] https://en.wikipedia.org/wiki/Format-preserving_encryption 
<https://en.wikipedia.org/wiki/Format-preserving_encryption>
[2] 
https://crypto.stackexchange.com/questions/24284/is-there-a-format-preserving-cryptographically-secure-hash
 
<https://crypto.stackexchange.com/questions/24284/is-there-a-format-preserving-cryptographically-secure-hash>
[3] https://dataprivacylab.org/dataprivacy/projects/linkage/lidap-wp19.pdf 
<https://dataprivacylab.org/dataprivacy/projects/linkage/lidap-wp19.pdf>
[4] 
https://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/
 
<https://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/>
[5] https://hbr.org/2015/02/theres-no-such-thing-as-anonymous-data 
<https://hbr.org/2015/02/theres-no-such-thing-as-anonymous-data>
[6] https://issues.apache.org/jira/browse/NIFI-3834
[7] https://issues.apache.org/jira/browse/NIFI-3833
[8] https://issues.apache.org/jira/browse/NIFI-2961
[9] https://issues.apache.org/jira/browse/NIFI-1885
[10] https://issues.apache.org/jira/browse/NIFI-1140


Andy LoPresto
[email protected]
[email protected]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 17, 2017, at 10:36 AM, Mike Thomsen <[email protected]> wrote:
> 
> Not if you use hashing. You'll get a field value like this (sha1
> algorithm): c3499c2729730a7f807efb8676a92dcb6f8a3f8f
> 
> For getting closer to the original data in the sort of values present,
> you'll need to try something like ARX.
> 
> On Tue, Oct 17, 2017 at 11:53 AM, Vyshali <[email protected]> wrote:
> 
>> Hi Chris,
>> 
>> Hashing using executescript processor means that I should write some coding
>> logic to do that.If so,will the format of the field will remain the same ?
>> 
>> Please explain me with examples.
>> 
>> Regards,
>> Vyshali
>> 
>> 
>> 
>> --
>> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
>> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to