Hi Fabian,

So I played around a bit more with the pipelines and I was able to launch
dataflow jobs but it's not completely working as expected.
The documentation around this is also a bit scattered everywhere so I'm not
sure I'll be able to figure out the final solution in a short period of
time.

Steps taken to get this working:
- Modified the code a bit, these changes will be merged soon [1]
- Generate a hop-fatjar.jar
- Upload a pipeline and the hop-metadata to Google Storage
  - Modify the run configuration to take the fat-jar from following
location /dataflow/template/hop-fatjar.jar (location in the docker image)
- Modified the default docker to include the fat jar:


* FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
<http://gcr.io/dataflow-templates-base/java11-template-launcher-base>*










*  ARG WORKDIR=/dataflow/template  RUN mkdir -p ${WORKDIR}  WORKDIR
${WORKDIR}  COPY hop-fatjar.jar .  ENV
FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"  ENV
FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*"  ENTRYPOINT
["/opt/google/dataflow/java_template_launcher"]*

- Save the image in the container registry (gcloud builds submit --tag
<image_location>:latest .)
- Create a new pipeline using following template:





































*{    "defaultEnvironment": {},    "image": "<your image
location>:latest",    "metadata": {        "description": "This template
allows you to start Hop pipelines on dataflow",        "name": "Template to
start a hop pipeline",        "parameters": [            {
"helpText": "Google storage location pointing to the pipeline you wish to
start",                "label": "Google storage location pointing to the
pipeline you wish to start",                "name": "HopPipelinePath",
          "regexes": [                    ".*"                ]
},            {                "helpText": "Google storage location
pointing to the Hop Metadata you wish to use",                "label":
"Google storage location pointing to the Hop Metadata you wish to use",
            "name": "HopMetadataPath",                "regexes": [
          ".*"                ]            },            {
"helpText": "Run configuration used to launch the pipeline",
"label": "Run configuration used to launch the pipeline",
"name": "HopRunConfigurationName",                "regexes": [
      ".*"                ]            }        ]    },    "sdkInfo": {
    "language": "JAVA"    }}*

- Fill in the parameters with the google storage location and run
configuration name
- Run the pipeline

Now we enter the point where things get a bit strange, when you follow all
these steps you will notice a dataflow job will be started.
This Dataflow job will then spawn another Dataflow job that contains the
actual pipeline, the original job started via the pipeline will fail but
your other job will run fine.
[image: image.png]
The Pipeline job expects that a job file gets generated in a specific
location and it will then pick up this file to execute the actual job.
This is the part we would probably have to change our code a bit to save
the job specification to that location and not start another job via the
Beam API.

Until we get that sorted out you will have 2 jobs where one will fail on
every run, I hope this is acceptable for now.

Cheers,
Hans

[1] https://github.com/apache/hop/pull/1644


On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <[email protected]>
wrote:

> Hi Fabian,
>
> I've been digging into this a bit and it seems we will need some code
> changes to make this work.
> As far as I can tell you have to use one of the docker templates Google
> provides to start a pipeline from a template.
> The issue we have is that our MainBeam class requires 3 arguments to work
> (filename/metadata/run configuration name).
> These 3 arguments need to be the 3 first arguments passed to the class, we
> have no named parameters implemented.
>
> When the template launches it calls java in the following way:
>
> Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam
> --pipelineLocation=test --runner=DataflowRunner --project=xxx
> --templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
> --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={
> "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257
> --region=us-central1 --serviceAccount=
> [email protected]
> --tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp
>
> In this case it will see the first 3 arguments and select them.
> [image: image.png]
>
> As I can not find a way to force those 3 arguments in there we will need
> to implement named parameters in that class, I tried a bit of a hack but it
> did not work, I changed the docker template to the following but the Google
> script then throws an error:
>
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam
> gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"
>
> As I think this will have great added value, I will work on this ASAP.
> When the work has been done we can even supply the image required from our
> DockerHub Account and you should be able to run Hop pipelines in dataflow
> by using a simple template.
>
> My idea will be to add support for the following 3 named parameters:
>  - HopPipelinePath -> location of the pipeline (can be Google Storage)
>  - HopMetadataPath -> location of the metadata file (can be Google storage)
>  - HopRunConfigurationName
>
> I'll post updates here on the progress.
>
> Cheers,
> Hans
>
> On Tue, 16 Aug 2022 at 11:36, Fabian Peters <[email protected]> wrote:
>
>> Hi Hans,
>>
>> No, I didn't yet have another go. The hints from Matt (didn't see that
>> mail on the list?) do look quite useful in the context of Datlow templates.
>> I'll try to see whether I can get a bit further, but if you have time to
>> have a look at it, I'd much appreciate!
>>
>> cheers
>>
>> Fabian
>>
>> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <
>> [email protected]>:
>>
>> Hi Fabian,
>>
>> Did you get this working and are you willing to share the final results?
>> If not I will see what I can do, and we can add it to our documentation.
>>
>> Cheers,
>> Hans
>>
>> On Thu, 11 Aug 2022 at 13:14, Matt Casters <[email protected]>
>> wrote:
>>
>>> When you run class org.apache.hop.beam.run.MainBeam you need to provide
>>> 3 arguments to run:
>>>
>>> 1. The filename of the pipeline to run
>>> 2. The filename which contains Hop metadata
>>> 3. The name of the pipeline run configuration to use
>>>
>>> See also for example:
>>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>>
>>> Good luck,
>>> Matt
>>>
>>>
>>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <[email protected]> wrote:
>>>
>>>> Hello Hans,
>>>>
>>>> I went through the flex-template process yesterday but the generated
>>>> template does not work. The main piece that's missing for me is how to pass
>>>> the actual pipeline that should be run. My test boiled down to:
>>>>
>>>> gcloud dataflow flex-template build
>>>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>>>       --image-gcr-path "
>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>>>       --sdk-language "JAVA" \
>>>>       --flex-template-base-image JAVA11 \
>>>>       --metadata-file
>>>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>>> \
>>>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>>>       --env
>>>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>>
>>>> gcloud dataflow flex-template run "todays-directories-`date
>>>> +%Y%m%d-%H%M%S`" \
>>>>     --template-file-gcs-location "
>>>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>>>     --region "europe-west1"
>>>>
>>>> With Dockerfile:
>>>>
>>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>>>
>>>> ARG WORKDIR=/dataflow/template
>>>> RUN mkdir -p ${WORKDIR}
>>>> WORKDIR ${WORKDIR}
>>>>
>>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>>>
>>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>>>
>>>>
>>>> And "todays-directories.json":
>>>>
>>>> {
>>>>     "defaultEnvironment": {},
>>>>     "image": "
>>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>>>     "metadata": {
>>>>         "description": "Test templates creation with Apache Hop",
>>>>         "name": "Todays directories"
>>>>     },
>>>>     "sdkInfo": {
>>>>         "language": "JAVA"
>>>>     }
>>>> }
>>>>
>>>> Thanks for having a look at it!
>>>>
>>>> cheers
>>>>
>>>> Fabian
>>>>
>>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>>>> [email protected]>:
>>>>
>>>> Hi Fabian,
>>>>
>>>> You have indeed found something we have not yet documented, mainly
>>>> because we have not yet tried it out ourselves.
>>>> The main class that gets called when running Beam pipelines is
>>>> "org.apache.hop.beam.run.MainBeam".
>>>>
>>>> I was hoping the "Import as pipeline" button on a job would give you
>>>> everything you need to execute this but it does not.
>>>> I'll take a closer look the following days to see what is needed to use
>>>> this functionality, could be that we need to export the template based on a
>>>> pipeline.
>>>>
>>>> Kr,
>>>> Hans
>>>>
>>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <[email protected]> wrote:
>>>>
>>>>> Hi all!
>>>>>
>>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs
>>>>> to Dataflow.
>>>>>
>>>>> Next, I'd like to schedule a batch job
>>>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>>> but for this I need to create a
>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>>>> template
>>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>>>> I've searched the Hop documentation but haven't found anything on this. 
>>>>> I'm
>>>>> guessing that flex-templates
>>>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template>
>>>>>  are
>>>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>>>
>>>>> cheers
>>>>>
>>>>> Fabian
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Neo4j Chief Solutions Architect
>>> *✉   *[email protected]
>>>
>>>
>>>
>>>
>>

Reply via email to