[jira] [Updated] (FLINK-34991) Flink Operator ClassPath Race Condition Bug

Ryan van Huuksloot (Jira) Tue, 02 Apr 2024 09:31:06 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-34991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ryan van Huuksloot updated FLINK-34991:
---------------------------------------
    Description: 
Hello,

I believe we've found a bug with the Job Managers of the Kubernetes Operator. I 
think there is a race condition or an incorrect conditional where the operator 
is checking for High Availability instead of seeing if there is an issue with 
Class Loading in the Job Manager.

*Example:*
When deploying a SQL Flink Job, it starts the job managers in a failed state.
Describing the flink deployment returns the Error message 
{code:java}
RestoreFailed ... HA metadata not available to restore from last state. It is 
possible that the job has finished or terminally failed, or the configmaps have 
been deleted.{code}
But upon further investigation, the actual error was that the class loading of 
the Job Manager wasn't correct. This was a log in the Job Manager
{code:java}
"Could not find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the 
classpath.\n\nAvailable factory identifiers 
are:\n\nblackhole\ndatagen\nfilesystem\nprint","name":"org.apache.flink.table.api.ValidationException","extendedStackTrace":"org.apache.flink.table.api.ValidationException:
 Could not find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the 
classpath.\n\nAvailable factory identifiers 
are:\n\nblackhole\ndatagen\nfilesystem\nprint\n\"{code}
There is also logging in the operator
{code:java}
... Cannot discover a connector using option: 'connector'='kafka'\n\tat 
org.apache.flink.table.factories.FactoryUtil.enrichNoMatchingConnectorError(FactoryUtil.java:798)\n\tat
 
org.apache.flink.table.factories.FactoryUtil.discoverTableFactory(FactoryUtil.java:772)\n\tat
 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:215)\n\t...
 52 more\nCaused by: org.apache.flink.table.api.ValidationException: Could not 
find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the classpath 
....{code}
I think that the operator should return this error in the CRD since the HA 
error is not the root cause. 

  was:
Hello,

I believe we've found a bug with the Job Managers of the Kubernetes Operator. I 
think there is a race condition or an incorrect conditional where the operator 
is checking for High Availability instead of seeing if there is an issue with 
Class Loading in the Job Manager.


*Example:*
When deploying a SQL Flink Job, it starts the job managers in a failed state.
Describing the flink deployment returns the Error message 
{code:java}
RestoreFailed ... HA metadata not available to restore from last state. It is 
possible that the job has finished or terminally failed, or the configmaps have 
been deleted.{code}
But upon further investigation, the actual error was that the class loading of 
the Job Manager wasn't correct. This was a log in the Job Manager
{code:java}
"Could not find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the 
classpath.\n\nAvailable factory identifiers 
are:\n\nblackhole\ndatagen\nfilesystem\nprint","name":"org.apache.flink.table.api.ValidationException","extendedStackTrace":"org.apache.flink.table.api.ValidationException:
 Could not find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the 
classpath.\n\nAvailable factory identifiers 
are:\n\nblackhole\ndatagen\nfilesystem\nprint\n\"{code}

There is also logging in the operator
{code:java}
... Cannot discover a connector using option: 'connector'='kafka'\n\tat 
org.apache.flink.table.factories.FactoryUtil.enrichNoMatchingConnectorError(FactoryUtil.java:798)\n\tat
 
org.apache.flink.table.factories.FactoryUtil.discoverTableFactory(FactoryUtil.java:772)\n\tat
 
org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:215)\n\t...
 52 more\nCaused by: org.apache.flink.table.api.ValidationException: Could not 
find any factory for identifier 'kafka' that implements 
'org.apache.flink.table.factories.DynamicTableFactory' in the classpath 
....{code}
I think that the operator should return this error in the CRD since the HA 
error is not the root cause. 


> Flink Operator ClassPath Race Condition Bug
> -------------------------------------------
>
>                 Key: FLINK-34991
>                 URL: https://issues.apache.org/jira/browse/FLINK-34991
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.7.2
>         Environment: Standard Flink Operator with Flink Deployment.
> To recreate, just remove a critical SQL connector library from the bundled jar
>            Reporter: Ryan van Huuksloot
>            Priority: Minor
>
> Hello,
> I believe we've found a bug with the Job Managers of the Kubernetes Operator. 
> I think there is a race condition or an incorrect conditional where the 
> operator is checking for High Availability instead of seeing if there is an 
> issue with Class Loading in the Job Manager.
> *Example:*
> When deploying a SQL Flink Job, it starts the job managers in a failed state.
> Describing the flink deployment returns the Error message 
> {code:java}
> RestoreFailed ... HA metadata not available to restore from last state. It is 
> possible that the job has finished or terminally failed, or the configmaps 
> have been deleted.{code}
> But upon further investigation, the actual error was that the class loading 
> of the Job Manager wasn't correct. This was a log in the Job Manager
> {code:java}
> "Could not find any factory for identifier 'kafka' that implements 
> 'org.apache.flink.table.factories.DynamicTableFactory' in the 
> classpath.\n\nAvailable factory identifiers 
> are:\n\nblackhole\ndatagen\nfilesystem\nprint","name":"org.apache.flink.table.api.ValidationException","extendedStackTrace":"org.apache.flink.table.api.ValidationException:
>  Could not find any factory for identifier 'kafka' that implements 
> 'org.apache.flink.table.factories.DynamicTableFactory' in the 
> classpath.\n\nAvailable factory identifiers 
> are:\n\nblackhole\ndatagen\nfilesystem\nprint\n\"{code}
> There is also logging in the operator
> {code:java}
> ... Cannot discover a connector using option: 'connector'='kafka'\n\tat 
> org.apache.flink.table.factories.FactoryUtil.enrichNoMatchingConnectorError(FactoryUtil.java:798)\n\tat
>  
> org.apache.flink.table.factories.FactoryUtil.discoverTableFactory(FactoryUtil.java:772)\n\tat
>  
> org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:215)\n\t...
>  52 more\nCaused by: org.apache.flink.table.api.ValidationException: Could 
> not find any factory for identifier 'kafka' that implements 
> 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath 
> ....{code}
> I think that the operator should return this error in the CRD since the HA 
> error is not the root cause. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34991) Flink Operator ClassPath Race Condition Bug

Reply via email to