[I] What to Include in metastore.json Without Database Components (hop)

via GitHub Sun, 26 Jan 2025 22:00:11 -0800


arunkumarArunachalam99 opened a new issue, #4832:
URL: https://github.com/apache/hop/issues/4832


   I'll try to write a full and cohesive answer.
   Let's start by rephrasing the original question:
   
   > I want to run a pipeline/workflow on Hop Server using the Rest API 
directly, without using Hop GUI or hop-run. How do I do this?
   
   ## Intro
   ### Hop Server
   **What it is**
   Hop server is a stateless server, its main purpose is to be used as an 
extension to Hop GUI to run a pipeline or workflow in a remote environment
   
   **What it isn't**
   Hop Server isn't the typical server that you would use for 
scheduling/monitoring it does not retain state or history, it does not store 
this information (except in memory for a short period). After a restart all 
previous information is lost.
   
   By circling back to what it is, we can also discuss why it is poorly 
documented. We do **not** want it to be used in a stand-alone way; it wasn't 
made for this. We did add the endpoints to our 
[documentation](https://hop.apache.org//manual/latest/hop-server/rest-api.html) 
because people were asking for them, but honestly, it was never designed to be 
used without using the GUI or Hop Run. There are better ways, eg. using short 
lived containers which provide more flexibility and in combination with airflow 
([tutorial](https://hop.apache.org//manual/latest/how-to-guides/run-hop-in-apache-airflow.html))
 you can also use webhooks to start things.
   
   Shameless plug: We (know.bi) are working on something better which we hope 
to showcase soon.
   
   This covers our disclaimer, let's get back to the subject.
   
   ## Running a single pipeline
   
   The process to start something on Hop Server is split up in 3 different 
categories:
   
   - A single pipeline
   - A single workflow
   - A workflow with other pipelines/workflows
   
   let's discuss starting a single pipeline, there are 3 steps that need to be 
taken to start a pipeline on Hop Server.
   
   ### registerPipeline
   The first step is to send the pipeline and all needed environment 
information to the server. As stated before the server is stateless so it knows 
nothing it needs all information to create a successful execution.
   The XML format of the request:
   
   ```
   <pipeline_configuration>
     <pipeline>
     </pipeline>
     <pipeline_execution_configuration>
       <variables></variables>
       <parameters></parameters>
       <pass_export>N</pass_export>
       <log_level>Basic</log_level>
       <log_file>N</log_file>
       <log_filename/>
       <log_file_append>N</log_file_append>
       <create_parent_folder>N</create_parent_folder>
       <clear_log>Y</clear_log>
       <show_subcomponents>Y</show_subcomponents>
       <run_configuration>local</run_configuration>
     </pipeline_execution_configuration>
     <metastore_json>
     </metastore_json>
   </pipeline_configuration>
   ```
   3 blocks of information need to be included in this request:
   **pipeline:**
   this one is simple it's the hpl file that you wish to execute on the server.
   **pipeline_execution_configuration:**
   This block contains an export of all Hop variables in the <variables> 
section and parameters/variables you have defined in the Run Options dialog
   
![image](https://github.com/user-attachments/assets/fb9ae198-06b1-45a3-aee7-c2bf3206085b)
   The variables section will also contain all variables you have defined in 
your environment, if you have defined database username/password and so on to 
an environment file they get added there.
   
   Each variable looks like
   `<variable><name>VARIABLE_NAME</name><value>VALUE</value></variable>`
   
   **metastore_json:**
   This is the part where it gets hard. The metastore_json is a Base64 encoded 
gzip stream.
   To get a fast/simple preview of what's in this you could take the example 
from our docs and throw it in [this](https://www.bugdays.com/gzip-base64) 
website.
   
   It boils down to a json containing all objects you have defined in the 
metadata perspective/metadata folder.
   
   example if you only have a PostgreSQL connection, but it also needs to 
contain your run targets and all other objects that are available in your 
metadata folder. Another note: each database type can have different fields 
(just like in the UI) most of them are shared, but eg MSSQL Server has more 
fields.
   
   ```
   {
     "rdbms": [
       {
         "rdbms": {
           "POSTGRESQL": {
             "databaseName": "postgres",
             "pluginId": "POSTGRESQL",
             "indexTablespace": null,
             "dataTablespace": null,
             "accessType": 0,
             "hostname": "localhost",
             "password": "",
             "pluginName": "PostgreSQL",
             "port": "5432",
             "servername": null,
             "attributes": {
               "SUPPORTS_TIMESTAMP_DATA_TYPE": "N",
               "QUOTE_ALL_FIELDS": "N",
               "SUPPORTS_BOOLEAN_DATA_TYPE": "Y",
               "FORCE_IDENTIFIERS_TO_LOWERCASE": "N",
               "PRESERVE_RESERVED_WORD_CASE": "Y",
               "SQL_CONNECT": "",
               "FORCE_IDENTIFIERS_TO_UPPERCASE": "N",
               "PREFERRED_SCHEMA_NAME": ""
             },
             "manualUrl": "",
             "username": "postgres"
           }
         },
         "name": "pg"
       }
     ]
   }
   ```
   
   After building and sending this request to the server (POST) you will get a 
response:
   ```
   <webresult>
     <result>OK</result>
     <message>Pipeline &#39;variables&#39; was added to HopServer with id 
08bdff17-0d75-43a3-b890-05783376cbb2</message>
     <id>08bdff17-0d75-43a3-b890-05783376cbb2</id>
   </webresult>
   ```
   
   ### prepareExec
   After you get back the Id you have to hit the prepareExec with a GET request
   `GET 
/hop/prepareExec/?name=variables&xml=Y&id=08bdff17-0d75-43a3-b890-05783376cbb2`
   
   response:
   ```
   <webresult>
     <result>OK</result>
     <message/>
     <id/>
   </webresult>
   ```
   
   This will prepare the pipeline for execution and it will enter a "waiting 
state"
   
   ### startExec
   The final step is a GET to startExec to start the actual execution
   
   `GET 
/hop/startExec/?name=variables&xml=Y&id=08bdff17-0d75-43a3-b890-05783376cbb2`
   
   response
   ```
   <webresult>
     <result>OK</result>
     <message/>
     <id/>
   </webresult>
   ```
   
   You can follow up how everything is going with the pipelineStatus endpoint.
   
   ## Closing note
   
   These steps should help you use the REST API directly to start a pipeline, 
running a single workflow is a similar process.
   Running a combination of workflows and pipelines requires more work as this 
is a specially crafted zip file that is sent to the server.
   
   Happy coding,
   Hans
   
   _Originally posted by @hansva in 
https://github.com/apache/hop/discussions/4634#discussioncomment-11422350_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] What to Include in metastore.json Without Database Components (hop)

Reply via email to