varaprasadregani opened a new issue, #58514:
URL: https://github.com/apache/airflow/issues/58514

   ### What do you see as an issue?
   
   **Docs link:** [Masking Sensitive 
Data](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/mask-sensitive-values.html#masking-sensitive-data)
   
   **Source code link:** 
[secrets_masker.py](https://github.com/apache/airflow/blob/main/airflow/shared/secrets_masker/src/airflow_shared/secrets_masker/secrets_masker.py)
   
   The documentation currently states:
   > “Airflow will by default mask Connection passwords and keys from a 
Connection’s extra (JSON) field when they appear in Task logs, in the Variable 
and in the Rendered fields views of the UI.”
   
   This statement is misleading because it implies that **all** keys in a 
Connection’s extra JSON field are masked. However, only keys whose names 
contain known sensitive keywords are actually redacted.
   
   The complete list of sensitive keywords from the source code is:
   `access_token`, `api_key`, `apikey`, `authorization`, `passphrase`, 
`passwd`, `password`, `private_key`, `secret`, `token`, `keyfile_dict`, 
`service_account`
   
   **Code used to reproduce this:**
   I verified this behavior using the following DAG, extracting values from a 
Connection's `extra` field (`bigquery_connection_id`) and Airflow Variables:
   
   ```python
   from airflow import DAG
   from airflow.operators.bash import BashOperator
   import pendulum
   from airflow.hooks.base import BaseHook
   from airflow.models import Variable
   
   # Fetch connection and extract 'extra' JSON
   conn = BaseHook.get_connection(conn_id="bigquery_connection_id")
   extra_data = conn.extra_dejson
   
   # Test specific keys from 'extra'
   keyfile_dict = extra_data.get("keyfile_dict", "not found") # Contains 
'keyfile_dict'
   param1_token = extra_data.get("param1_token", "not found") # Contains 'token'
   hello = extra_data.get("hello", "not found")               # No sensitive 
keyword
   
   # Test Variables
   test_keyfile_dict = Variable.get("test_keyfile_dict")
   service_account = Variable.get("service_account")
   
   with DAG(
       dag_id="test_masking",
       start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
       schedule=None,
       catchup=False,
   ) as dag:
   
       test_masking = BashOperator(
           task_id="masking_task",
           bash_command=f"echo '{ keyfile_dict }' > { param1_token } > { hello 
} > { test_keyfile_dict } > { service_account }"
       )
   ```
   
   **From my testing:**
   * `extra__google_cloud_platform__keyfile_dict` (from the connection’s extra 
JSON) → **Masked** everywhere (Rendered templates, UI, logs).
   * `hello` (no sensitive keyword) → **Not masked**.
   * `Variable.get("test_keyfile_dict")` → **Masked** only in Variables UI.
   * `Variable.get("service_account")` → **Masked** in Variables UI, Rendered 
templates, and logs.
   
   The docs should clarify that not all keys in a Connection’s extra JSON are 
masked—only those containing a sensitive keyword.
   
   **Screenshots of observations:**
   Rendered Templates:
   
   <img width="754" height="170" alt="Image" 
src="https://github.com/user-attachments/assets/461a937c-4319-465b-a6ff-9f329ce0f9c8";
 />
   
   Logs:
   
   <img width="900" height="368" alt="Image" 
src="https://github.com/user-attachments/assets/1e9aa32c-fe9b-41ef-851d-ca8b5e941f29";
 />
   
   Variables UI:
   
   <img width="1304" height="318" alt="Image" 
src="https://github.com/user-attachments/assets/29d47709-618f-4f68-bce4-5941983deeff";
 />
   
   ### Solving the problem
   
   I suggest two specific updates to the documentation to fix this ambiguity 
and clarify the scope of masking:
   
   **1. Update the default masking paragraph**
   Clarify that masking is conditional on the key name.
   
   *Current Text:*
   > Airflow will by default mask Connection passwords and sensitive Variables 
and keys from a Connection’s extra (JSON) field when they appear in Task logs, 
in the Variable and in the Rendered fields views of the UI.
   
   *Proposed Text:*
   > Airflow will by default mask Connection passwords, sensitive Variables, 
and keys from a Connection’s extra (JSON) field **whose names contain one or 
more of the sensitive keywords** when they appear in Task logs, in the 
Variables UI, and in the Rendered fields views of the UI. Keys in the extra 
JSON that do not include any of these sensitive keywords will not be redacted 
automatically.
   
   **2. Update the "Sensitive field names" section**
   In the [Sensitive field 
names](https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/mask-sensitive-values.html#sensitive-field-names)
 section, explicitly list the default keywords and add a table illustrating how 
the source and keyword affect where the data is masked.
   
   *Suggested Addition:*
   > **Default Sensitive Keywords:**
   > `access_token`, `api_key`, `apikey`, `authorization`, `passphrase`, 
`passwd`, `password`, `private_key`, `secret`, `token`, `keyfile_dict`, 
`service_account`.
   >
   > **Examples of Masking Behavior:**
   >
   > | Source | Key / Variable Name | Matching Keyword | Masking Scope |
   > | :--- | :--- | :--- | :--- |
   > | **Connection Extra** | `google_keyfile_dict` | `keyfile_dict` | 
**Everywhere** (Logs, Rendered Templates, UI) |
   > | **Connection Extra** | `hello` | *None* | **Not Masked** |
   > | **Variable** | `service_account` | `service_account` | **Everywhere** 
(Logs, Rendered Templates, UI) |
   > | **Variable** | `test_keyfile_dict` | `keyfile_dict` | **Variables UI 
Only** |
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to