portikCoder opened a new issue, #31167:
URL: https://github.com/apache/beam/issues/31167

   ### What happened?
   
   Hi
   
   I'm hoping there is somebody out there already figured this out if it's 
possible and I'm just not capable enough to setup the environment correctly in 
order to CI test my whole pipeline.
   
   Context: using Redpanda as a Kafka replacement trying to create integration 
tests without the need to involve any 3pp runner, so via DirectRunner. I've 
already successfully managed to use Dataflow runner e.g. from within the CI 
env, that's no issue, tho as u know `ReadFromKafka` has to be using 2 external 
dependencies from the Java SDK in order to function.
   One being the `beam-sdks-java-io-expansion-service-2.5X.0.jar`, while the 
other is `apache/beam_java21_sdk:2.5X.0` docker container.
   
   On my local machine I'm fully capable of setting this up (M1 Mac) so it's 
working & succeeding, while on CI I've run into multiple issues.
   Most of them I was able to resolve by manually downloading the SDK.jar file 
and spin that up, moreover specifying it for 
`ReadFromKafka(...expansion_service="localhost:8097")` and based on logs the 
python ABeam template using DirectRunner was able to connect to it, moreover I 
manually pulled the docker image, which then it was able to spin up. Check logs 
from within the CI (Gitlab) worker:
   
   ```
   $ netstat -an | grep 8097 || echo "Port 8097 is available"
   Port 8097 is available
   $ java -jar ${EXPANSION_FILE_NAME} 8097 &
   Starting expansion service at localhost:8097
   ...
   
   docker pull docker.io/apache/beam_java21_sdk:2.52.0
   ...
   Status: Image is up to date for apache/beam_java21_sdk:2.52.0
   ...
   
   $ netstat -an | grep 8097
   tcp        0      0 0.0.0.0:8097            0.0.0.0:*               LISTEN   
  
   $ python -m beam_streamer local --topic-name mytopicname
   ...
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:starting 
control server on port 45871
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:starting 
data server on port 33019
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:starting 
state server on port 36783
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:starting 
logging server on port 41913
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Created 
Worker handler 
<apache_beam.runners.portability.fn_api_runner.worker_handlers.DockerSdkWorkerHandler
 object at 0x7db95812c950> for environment external_1beam:env:docker:v1 
(beam:env:docker:v1, b'\n\x1dapache/beam_java17_sdk:2.52.0')
   
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Attempting 
to pull image apache/beam_java17_sdk:2.52.0
   2.52.0: Pulling from apache/beam_java17_sdk
   Digest: 
sha256:93d5cc70211b2e6d73ce3dda909599401e6c7c6a1ebf28fd8479426509237f25
   Status: Image is up to date for apache/beam_java17_sdk:2.52.0
   docker.io/apache/beam_java17_sdk:2.52.0
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Waiting 
for docker to start up. Current status is running
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Docker 
container is running. container_id = 
b'e600645ec4041e08d42fc3d59e85ddd00abac0207ebee078bf9528b9658b3e79', worker_id 
= worker_0
   INFO:apache_beam.runners.worker.statecache:Creating state cache with size 
104857600
   INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Created 
Worker handler 
<apache_beam.runners.portability.fn_api_runner.worker_handlers.EmbeddedWorkerHandler
 object at 0x7db95812dc10> for environment 
ref_Environment_default_environment_2 (beam:env:embedded_python:v1, b'')
   
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'2024/04/25 
21:37:27 Failed to obtain provisioning information: failed to dial server at 
localhost:45871\n\tcaused by:\ncontext deadline exceeded\n'
   ```
   
   I have a feeling the docker is using wrong host but looking into the 
`apache_beam.runners.portability.fn_api_runner.worker_handlers.DockerSdkWorkerHandler`
 it's impossible to set my own host name for docker, so I have no chance to 
change this which might answer why the context deadline.
   
   Did anybody // experience the same // correctly setup this locally?
   
   
   Some tech. details:
   ABeam version: `2.52` (I had issues with 2.55 where it didn't download the 
docker image and wasn't able to consume locally)
   CI: `Gitlab Premium`
     * I've also tried using `- name: docker:19-dind` service together thinking 
it might be necessary for Docker inside Docker but ended up not using it bc I 
had the same effect as above...
   Local: `M1 Mac`
   
   Sample .py file:
   ```py
       kafka_config = {
           "bootstrap.servers": bootstrap_servers,
           "group.id": groupid,
           "auto.offset.reset": "earliest",
       }
   
       with Pipeline(options=pipeline_options) as p:
           kafka_records = p | ReadFromKafka(
               consumer_config=kafka_config,
               topics=[topic_name],
               max_read_time=10,
               expansion_service="localhost:8097"
           )
   ```
   
   And the gitlab.ci:
   ```yaml
   beam streamer integration tests:
     stage: test
     extends:
       - .install_beam_streamer_deps
     image: python:3.11.8
     variables:
       KAFKA_ADDR: 127.0.0.1:19092
       KAFKA_TOPIC: mytopicname
       EXPANSION_FILE_NAME: beam-sdks-java-io-expansion-service-2.52.0.jar
   
   #    DOCKER_HOST: docker0
   #    DOCKER_HOST: tcp://localhost:2375
   #    DOCKER_DRIVER: overlay2
   #    DOCKER_TLS_CERTDIR: ""
     services:
       - name: docker.redpanda.com/redpandadata/redpanda:v23.1.13
         command:
           - redpanda
           - start
           - --smp
           - '1'
           - --memory
           - 256M
           - --reserve-memory
           - 0M
           - --overprovisioned
           - --node-id
           - '0'
           - --kafka-addr
           - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:19092
           - --advertise-kafka-addr
           - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:19092
           - --pandaproxy-addr
           - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
           - --advertise-pandaproxy-addr
           - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082
   #    - name: docker:19-dind  # It's for `apache/beam_java21_sdk:2.52.0` that 
will be spun up by ABeam automatically.
   #      command: ["--tls=false"]
     script:
       - curl -LO 
https://github.com/redpanda-data/redpanda/releases/latest/download/rpk-linux-amd64.zip
       - mkdir -p ~/.local/bin
       - export PATH="~/.local/bin:$PATH"
       - unzip rpk-linux-amd64.zip -d ~/.local/bin/
       - rpk version
       - rpk --brokers ${KAFKA_ADDR} topic create ${KAFKA_TOPIC}
       - echo '{"a":1, "b":2}\n{"b":1, "c":5}' | rpk --brokers ${KAFKA_ADDR} 
topic produce ${KAFKA_TOPIC}
       - apt-get update && apt-get install -y openjdk-17-jdk
       - java -version
       - apt-get update && apt-get install -y docker.io net-tools
       - docker --version
       - docker pull docker.io/apache/beam_java21_sdk:2.52.0
       - curl 
"https://repo.maven.apache.org/maven2/org/apache/beam/beam-sdks-java-io-expansion-service/2.52.0/${EXPANSION_FILE_NAME}";
 --output ${EXPANSION_FILE_NAME}
       - netstat -an | grep 8097 || echo "Port 8097 is available"
       - java -jar ${EXPANSION_FILE_NAME} 8097 &
       - sleep 5 #&& curl -v http://localhost:8097/status
       - netstat -an | grep 8097
       - python -m beam_streamer local --topic-name E2E-operations
   ```
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to