Hi, everyone

I want to discuss the management and specification of the HOME environment 
variables of multiple executors, and am looking forward to your suggestions.

It is recommended to directly refer to the issue: 
https://github.com/apache/incubator-dolphinscheduler/issues/5035


Describe the question

The `HOME` environment variables of multiple executors are as follows:

```
export HADOOP_HOME=/opt/soft/hadoop
export HADOOP_CONF_DIR=/opt/soft/hadoop/etc/hadoop
export SPARK_HOME1=/opt/soft/spark1
export SPARK_HOME2=/opt/soft/spark2
export PYTHON_HOME=/opt/soft/python
export JAVA_HOME=/opt/soft/java
export HIVE_HOME=/opt/soft/hive
export FLINK_HOME=/opt/soft/flink
export DATAX_HOME=/opt/soft/datax/bin/datax.py

export 
PATH=$HADOOP_HOME/bin:$SPARK_HOME1/bin:$SPARK_HOME2/bin:$PYTHON_HOME:$JAVA_HOME/bin:$HIVE_HOME/bin:$PATH:$FLINK_HOME/bin:$DATAX_HOME:$PATH
```

But the value of some `HOME` variables is not reasonable, like `PYTHON_HOME` 
and `DATAX_HOME`. Both of them are the file path, so `$PYTHON_HOME` and 
`$DATAX_HOME` in `PATH` will not take effect. This will make people feel 
confused.

**Python/Datax related issues**: #113 #708 #2548 #2620 #2868 #3853 #4122 #4158 
#5018 #5024

At the same time, for multiple versions of an executor, we will encounter 
scalability problems, like `SPARK_HOME1` and `SPARK_HOME2`, since the Spark 
3.0.0 has released on June 18, 2020 (Now the latest version is 3.1.1). And 
`$SPARK_HOME1/bin:$SPARK_HOME2/bin` in `PATH` can lead to influence each other.

For `PYTHON_HOME`, `python2` is usually built into the operating system. So 
`PYTHON_HOME` is mainly set for `python3`.

So, we need to think about the management and specification of `HOME` 
environment variables for multiple executors (maybe with multiple versions).

**What are the current deficiencies and the benefits of improvement**

Specifically, as shown below:

- The `HOME` environment variables of all executors should use the executor 
root directory instead of the binary path
- Use absolute path of an executor binary based on the `HOME` environment 
variable when executing a task command
- There are two solutions for multiple versions of an executor:
  - Just use **ONE** `HOME` environment variable for an executor, so 
`SPARK_HOME1` and `SPARK_HOME2` can be integrated into `SPARK_HOME`. Use 
`worker.groups` to distinguish different versions of an executor. for example, 
set `SPARK_HOME=/opt/spark-1.6.3` on node1 and `SPARK_HOME=/opt/spark-2.4.7` on 
node2, we only set `worker.groups=spark1` on node1 and `worker.groups=spark2` 
on node2. **The cost of this solution is small**.
  - Use **a table in the database** to manage multiple versions like 
`t_ds_executor_definition`, and it's managed and edited by user. The table 
schema contains `task type`, `executor name`, `executor HOME path` and etc. 
**This solution has very good scalability**.
    - This feature can fix anaconda problem like #4158 by adding `ANACONDA_HOME`

**Spark/Flink related issues**:  #839 #3359 #3744 #3785

**Which version of DolphinScheduler:**
 -[dev]


Best Regards

--
DolphinScheduler(Incubator) Contributor
Shiwen Cheng 程世文
Mobile: (+86)15201523580
Email: [email protected]

Reply via email to