bydeath commented on issue #306: URL: https://github.com/apache/flink-agents/issues/306#issuecomment-3501761953
> I think this a known bug of pemja: https://issues.apache.org/jira/browse/FLINK-38585, and has been fixed in pemja recently [alibaba/pemja#87](https://github.com/alibaba/pemja/pull/87). > > But because Flink-Agents is indirectly depended on pemja through pyflink, Flink-Agents must wait until Flink releases a version containing the pemja fix before this issue can be resolved. Hi @wenjin272, Thank you for your response and for pointing out the related JIRA issue, [FLINK-38585](https://issues.apache.org/jira/browse/FLINK-38585). Based on my analysis and the stack trace, it appears that my issue with `flink-agents` is distinct from `FLINK-38585`, which specifically addresses problems in PyFlink's thread mode execution based on Pemja. Furthermore, I have not observed `flink-agents` explicitly configuring Pemja to use thread mode (e.g., by setting `python.execution-mode` to `thread`), suggesting the nature of the Pemja usage is fundamentally different from the one targeted by FLINK-38585. The failure I am encountering occurs when the `flink-agents` framework attempts to initialize its own embedded Python environment. The stack trace clearly indicates that the failure happens directly within the `flink-agents` operator loading the Python interpreter via Pemja: ```java // ... (omitted) at pemja.core.PythonInterpreter.<init>(PythonInterpreter.java:45) ~[flink-python-1.20.3.jar:1.20.3] at org.apache.flink.agents.runtime.env.EmbeddedPythonEnvironment.getInterpreter(EmbeddedPythonEnvironment.java:45) ~[flink-agents-dist-0.1.0.jar:0.1.0] at org.apache.flink.agents.runtime.python.utils.PythonActionExecutor.open(PythonActionExecutor.java:80) ~[flink-agents-dist-0.1.0.jar:0.1.0] at org.apache.flink.agents.runtime.operator.ActionExecutionOperator.initPythonActionExecutor(ActionExecutionOperator.java:504) ~[flink-agents-dist-0.1.0.jar:0.1.0] // ... (omitted) Caused by: java.io.IOException: Failed to execute the command: ... /venv.tar.gz/bin/python -c from find_libpython import find_libpython;print(find_libpython()) Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding ModuleNotFoundError: No module named 'encodings' ``` As you can see, the call sequence confirms that flink-agents is directly using Pemja's PythonInterpreter for environment initialization: 1. The EmbeddedPythonEnvironment.getInterpreter() method (source: [EmbeddedPythonEnvironment.java#L45](https://github.com/apache/flink-agents/blob/fcaabe7dbe6b04f00da9c5a3563e9599710088ce/runtime/src/main/java/org/apache/flink/agents/runtime/env/EmbeddedPythonEnvironment.java#L45)) returns the PythonInterpreter instance. 2. This interpreter instance is created via the logic in PythonEnvironmentManager.createEnvironment() (source: [PythonEnvironmentManager.java#L45-L83](https://github.com/apache/flink-agents/blob/fcaabe7dbe6b04f00da9c5a3563e9599710088ce/runtime/src/main/java/org/apache/flink/agents/runtime/env/PythonEnvironmentManager.java#L45-L83)). 3. The failure (ModuleNotFoundError: No module named 'encodings') happens inside Pemja's constructor (PythonInterpreter.<init>) during this direct initialization call. The core issue remains that Pemja fails to initialize the self-contained Conda environment when launched by flink-agents in YARN mode. I suspect that even after a Pemja bug fix is incorporated into a new Flink release, this specific issue with flink-agents may not be resolved, because the problem appears tied to path resolution logic within flink-agents' direct usage of Pemja, and not just the execution model addressed in FLINK-38585. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
