[ 
https://issues.apache.org/jira/browse/ATLAS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

VINAYAK MARRAIYA updated ATLAS-5238:
------------------------------------
    Description: 
Lineage events generated by *Apache Impala* currently do not include explicit 
information about the *operation type* of the executed query.

Example lineage event produced by Impala:

 
{code:java}
{

"queryText": "create table test_db_01.test_tbl_01 (id int)",
"queryId": "b44da06a10682ce9:286bd74300000000",
"hash": "7debad31b299d7cccdf78a67968eb39d",
"user": "[email protected]",
"timestamp": 1771622004,
"endTime": 1771622005,
"edges": [],
"vertices": []
}           {code}
 
What Impala Provides
Impala emits lineage events that include information such as:
 * {{queryText}}

 * {{queryId}}

 * execution timestamps

 * lineage graph ({{{}edges{}}} and {{{}vertices{}}})

However, the event *does not include the operation type* (e.g., {{{}CREATE{}}}, 
{{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}).
h3. What Atlas Currently Needs to Do

When processing lineage events, *Apache Atlas* requires the *operation type* to 
correctly interpret the query and construct lineage relationships.

Since Impala does not provide this information, the Atlas Impala integration 
attempts to {*}derive the operation type from the {{queryText}}{*}. This is 
implemented in the Atlas hook ({{{}ImpalaLineageHook{}}}) using regex-based 
parsing logic in {{{}ImpalaOperationParser{}}}.

This approach is {*}not fully reliable{*}, as certain SQL constructs can break 
the parsing logic. For example:
 * SQL statements containing *single-line comments*

 * variations in SQL formatting

 * complex query structures

These cases may lead to incorrect or missing operation type detection.
h3. Possible Improvements

One option is to ensure that the {{queryText}} included in lineage events is 
always a *valid SQL statement* (see IMPALA-14741). However, Atlas would still 
need to infer the operation type.

A more robust approach would be for *Apache Impala* to include an *explicit 
operation type field* in the lineage event payload. If this information is 
provided directly, *Apache Atlas* can consume it without relying on fragile 
regex-based parsing of the SQL text, improving the reliability of lineage 
ingestion.

  was:
Lineage events generated by Impala currently do not include explicit 
information about the *operation type* of the executed query.

 
{code:java}
Example lineage event:{

"queryText": "create table test_db_01.test_tbl_01 (id int)",
"queryId": "b44da06a10682ce9:286bd74300000000",
"hash": "7debad31b299d7cccdf78a67968eb39d",
"user": "[email protected]",
"timestamp": 1771622004,
"endTime": 1771622005,
"edges": [],
"vertices": []
}           {code}
When ingesting Impala lineage events, *Apache Atlas* requires the *operation 
type* (e.g., {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}) to 
correctly interpret the query and construct the appropriate lineage 
relationships.

Since this information is not present in the lineage event, the Atlas Impala 
integration currently attempts to {*}derive the operation type from the query 
text{*}. This is implemented in the Atlas hook ({{{}ImpalaLineageHook{}}}) 
using regular expression parsing logic in {{{}ImpalaOperationParser{}}}.

However, this regex-based approach is not fully reliable and can fail in 
certain cases. For example, SQL statements that contain *single-line comments 
or other formatting variations* may prevent the parser from correctly 
identifying the operation type.

One possible improvement is to ensure that the {{queryText}} included in the 
lineage event is always a valid SQL statement (see IMPALA-14741). However, this 
still requires Atlas to infer the operation type from the query text.
h3. Proposed Improvement

To improve reliability for downstream lineage consumers such as {*}Apache 
Atlas{*}, Impala could include an *explicit operation type field* in the 
lineage event payload. Providing this information directly would remove the 
need for regex-based parsing in Atlas and ensure more accurate lineage 
processing.

Once this information is available in the lineage event, the Atlas Impala hook 
can be updated to {*}consume the provided operation type instead of deriving it 
from the SQL text{*}.


> Add operation type to the lineage graph
> ---------------------------------------
>
>                 Key: ATLAS-5238
>                 URL: https://issues.apache.org/jira/browse/ATLAS-5238
>             Project: Atlas
>          Issue Type: Task
>          Components:  atlas-core
>    Affects Versions: 3.0.0
>            Reporter: VINAYAK MARRAIYA
>            Assignee: VINAYAK MARRAIYA
>            Priority: Major
>
> Lineage events generated by *Apache Impala* currently do not include explicit 
> information about the *operation type* of the executed query.
> Example lineage event produced by Impala:
>  
> {code:java}
> {
> "queryText": "create table test_db_01.test_tbl_01 (id int)",
> "queryId": "b44da06a10682ce9:286bd74300000000",
> "hash": "7debad31b299d7cccdf78a67968eb39d",
> "user": "[email protected]",
> "timestamp": 1771622004,
> "endTime": 1771622005,
> "edges": [],
> "vertices": []
> }           {code}
>  
> What Impala Provides
> Impala emits lineage events that include information such as:
>  * {{queryText}}
>  * {{queryId}}
>  * execution timestamps
>  * lineage graph ({{{}edges{}}} and {{{}vertices{}}})
> However, the event *does not include the operation type* (e.g., 
> {{{}CREATE{}}}, {{{}INSERT{}}}, {{{}SELECT{}}}, {{{}ALTER{}}}).
> h3. What Atlas Currently Needs to Do
> When processing lineage events, *Apache Atlas* requires the *operation type* 
> to correctly interpret the query and construct lineage relationships.
> Since Impala does not provide this information, the Atlas Impala integration 
> attempts to {*}derive the operation type from the {{queryText}}{*}. This is 
> implemented in the Atlas hook ({{{}ImpalaLineageHook{}}}) using regex-based 
> parsing logic in {{{}ImpalaOperationParser{}}}.
> This approach is {*}not fully reliable{*}, as certain SQL constructs can 
> break the parsing logic. For example:
>  * SQL statements containing *single-line comments*
>  * variations in SQL formatting
>  * complex query structures
> These cases may lead to incorrect or missing operation type detection.
> h3. Possible Improvements
> One option is to ensure that the {{queryText}} included in lineage events is 
> always a *valid SQL statement* (see IMPALA-14741). However, Atlas would still 
> need to infer the operation type.
> A more robust approach would be for *Apache Impala* to include an *explicit 
> operation type field* in the lineage event payload. If this information is 
> provided directly, *Apache Atlas* can consume it without relying on fragile 
> regex-based parsing of the SQL text, improving the reliability of lineage 
> ingestion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to