[ https://issues.apache.org/jira/browse/PARQUET-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738237#comment-17738237 ]
ASF GitHub Bot commented on PARQUET-2223: ----------------------------------------- ggershinsky commented on PR #1112: URL: https://github.com/apache/parquet-mr/pull/1112#issuecomment-1611882968 Per the discussion in https://issues.apache.org/jira/browse/PARQUET-2223, we need to converge on the design first. I left a number of comments in the design googledoc. > Parquet Data Masking for Column Encryption > ------------------------------------------ > > Key: PARQUET-2223 > URL: https://issues.apache.org/jira/browse/PARQUET-2223 > Project: Parquet > Issue Type: New Feature > Reporter: Jiashen Zhang > Priority: Major > > h1. Background > h2. What is Data Masking? > Data masking is a technique used to protect sensitive data by replacing it > with modified or obscured values. The purpose of data masking is to ensure > that sensitive information, such as Personally Identifiable Information > (PII), remains hidden from unauthorized users while allowing authorized users > to perform their tasks. > Here are a few key points about data masking: > * Protection of Sensitive Data: Data masking helps to safeguard sensitive > data, such as Social Security numbers, credit card numbers, names, addresses, > and other personally identifiable information. By applying masking > techniques, the original values are replaced with fictional or transformed > data that retains the format and structure but removes any identifiable > information. > * Controlled Access: Data masking enables controlled access to sensitive > data. Authorized users, typically with appropriate permissions, can access > the unmasked or original data, while unauthorized users or users without the > necessary permissions will only see the masked data. > * Various Masking Techniques: There are different masking techniques > available, depending on the specific data privacy requirements and use cases. > Some commonly used techniques include: > ** Nullification: Replacing original data with NULL values. > ** Randomization: Replacing sensitive data with randomly generated values. > ** Substitution: Replacing sensitive data with fictional but realistic > values. > ** Hashing: Transforming sensitive data into irreversible hashed values. > ** Redaction: Removing or masking specific parts of sensitive data while > retaining other non-sensitive information. > * Compliance and Data Privacy: Data masking is often employed to comply with > data protection regulations and maintain data privacy. By masking sensitive > data, we can reduce the risk of data breaches and unauthorized access while > still allowing legitimate users to perform their tasks. > * Maintaining Data Consistency: Data masking techniques aim to maintain data > consistency and integrity by ensuring that masked data retains the original > data's format, structure, and relationships. This allows applications and > processes that rely on the data to continue functioning correctly. > h2. Why do we need it? > Data masking serves several important purposes and provides numerous > benefits. Here are some reasons why we need data masking: > * Data Privacy and Compliance: Data masking helps us comply with data > privacy regulations such as the General Data Protection Regulation (GDPR) and > the Health Insurance Portability and Accountability Act (HIPAA). These > regulations require us to protect sensitive data and ensure that it is only > accessible to authorized individuals. Data masking enables us to comply with > these regulations by de-identifying sensitive data. > * Minimize Data Exposure: By masking sensitive data, we can reduce the risk > of data breaches and unauthorized access. If a security breach occurs, the > exposed data will be meaningless to unauthorized users due to the masking. > This helps protect individuals' privacy and prevents misuse of sensitive > information. > * Secure Testing and Development Environments: Data masking is particularly > useful in creating secure testing and development environments. By masking > sensitive data, we can use realistic but fictional data for testing, > analysis, and development activities without exposing real personal or > sensitive information. > * Enhanced Data Sharing: Data masking allows us to share data with external > parties, such as partners or third-party vendors, while protecting sensitive > information. Masked data can be shared with confidence, as the original > sensitive values are replaced with transformed or fictional data. > * Employee Privacy: Data masking helps protect employee privacy by > obfuscating sensitive employee information, such as social security numbers > or salary details, in databases or HR systems. This safeguards employees' > personal data from unauthorized access or internal misuse. > * Insider Threat Mitigation: Data masking reduces the risk posed by insider > threats, where authorized individuals intentionally or accidentally misuse or > expose sensitive data. By masking data, even individuals with access to the > data will only see masked or fictional values, limiting the potential damage > caused by internal security breaches. > * Flexibility and Granularity: Data masking techniques offer flexibility and > granularity in selecting the level of masking required for different types of > data. We can determine the appropriate masking technique based on the > sensitivity of the data and the specific use case. > Overall, data masking is essential for protecting sensitive data, maintaining > compliance with regulations, mitigating data breach risks, and enabling > secure data sharing and testing environments. It plays a crucial role in > ensuring data privacy and maintaining the trust of individuals whose data is > being processed. -- This message was sent by Atlassian Jira (v8.20.10#820010)