Will-Lo commented on a change in pull request #3154: URL: https://github.com/apache/incubator-gobblin/pull/3154#discussion_r528402591
########## File path: gobblin-docs/user-guide/Docker-Integration.md ########## @@ -18,68 +18,68 @@ The `gobblin/gobblin-wikipedia` repository contains images that run the Gobblin The `gobblin/gobblin-standalone` repository contains images that run a [Gobblin standalone service](Gobblin-Deployment#standalone-architecture) inside a Docker container. These images provide an easy and simple way to setup a Gobblin standalone service on any Docker compatible machine. +The `gobblin/gobblin-service` repository contains images that run [Gobblin as a service](Building-Gobblin-as-a-Service#running-gobblin-as-a-service-with-docker), which is a service that takes in a user request (a logical flow) and converts it into a series of Gobblin Jobs, and monitors these jobs in a distributed manner. + The `gobblin/gobblin-base` and `gobblin/gobblin-distributions` repositories are for internal use only, and are primarily useful for Gobblin developers. -## Gobblin-Wikipedia Repository +# Run Gobblin Standalone + +The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-wikipedia/). These images are mainly meant to act as a "Hello World" example for the Gobblin-Docker integration, and to provide a sanity check to see if the Gobblin-Docker integration is working on a given machine. The image contains the Gobblin configuration files to run the [Gobblin Wikipedia job](../Getting-Started). When a container is launched using the `gobblin-wikipedia` image, Gobblin starts up, runs the Wikipedia example, and then exits. +### Set working directory -Running the `gobblin-wikipedia` image requires taking following steps (lets assume we want to an Ubuntu based image): +Before running docker containers, set a working directory for Gobblin jobs: -* Download the images from the `gobblin/gobblin-wikipedia` repository +`export LOCAL_JOB_DIR=<local_gobblin_directory>` -``` -docker pull gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +We will use this directory as the [volume](https://docs.docker.com/storage/volumes/) for Gobblin jobs and outputs. Make sure your Docker has the [access](https://docs.docker.com/docker-for-mac/#file-sharing) to this folder. This is the prerequisite for all following example jobs. -* Run the `gobblin/gobblin-wikipedia:ubuntu-gobblin-latest` image in a Docker container +### Run the docker image with simple wikipedia jobs -``` -docker run gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +Run these commands to start the docker image: -The logs are printed to the console, and no errors should pop up. This should provide a nice sanity check to ensure that everything is working as expected. The output of the job will be written to a directory inside the container. When the container exits that data will be lost. In order to preserve the output of the job, continue to the next step. +`docker pull gobblin/gobblin-standalone:alpine-gaas-latest` Review comment: We should use the default latest tag here, so `docker pull gobblin/gobblin-standalone:latest`. We also might be migrating to apache dockerhub sometime soon but we can edit this doc later to reflect that. ########## File path: gobblin-docs/user-guide/Docker-Integration.md ########## @@ -18,68 +18,68 @@ The `gobblin/gobblin-wikipedia` repository contains images that run the Gobblin The `gobblin/gobblin-standalone` repository contains images that run a [Gobblin standalone service](Gobblin-Deployment#standalone-architecture) inside a Docker container. These images provide an easy and simple way to setup a Gobblin standalone service on any Docker compatible machine. +The `gobblin/gobblin-service` repository contains images that run [Gobblin as a service](Building-Gobblin-as-a-Service#running-gobblin-as-a-service-with-docker), which is a service that takes in a user request (a logical flow) and converts it into a series of Gobblin Jobs, and monitors these jobs in a distributed manner. + The `gobblin/gobblin-base` and `gobblin/gobblin-distributions` repositories are for internal use only, and are primarily useful for Gobblin developers. -## Gobblin-Wikipedia Repository +# Run Gobblin Standalone + +The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-wikipedia/). These images are mainly meant to act as a "Hello World" example for the Gobblin-Docker integration, and to provide a sanity check to see if the Gobblin-Docker integration is working on a given machine. The image contains the Gobblin configuration files to run the [Gobblin Wikipedia job](../Getting-Started). When a container is launched using the `gobblin-wikipedia` image, Gobblin starts up, runs the Wikipedia example, and then exits. +### Set working directory -Running the `gobblin-wikipedia` image requires taking following steps (lets assume we want to an Ubuntu based image): +Before running docker containers, set a working directory for Gobblin jobs: -* Download the images from the `gobblin/gobblin-wikipedia` repository +`export LOCAL_JOB_DIR=<local_gobblin_directory>` -``` -docker pull gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +We will use this directory as the [volume](https://docs.docker.com/storage/volumes/) for Gobblin jobs and outputs. Make sure your Docker has the [access](https://docs.docker.com/docker-for-mac/#file-sharing) to this folder. This is the prerequisite for all following example jobs. -* Run the `gobblin/gobblin-wikipedia:ubuntu-gobblin-latest` image in a Docker container +### Run the docker image with simple wikipedia jobs -``` -docker run gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +Run these commands to start the docker image: -The logs are printed to the console, and no errors should pop up. This should provide a nice sanity check to ensure that everything is working as expected. The output of the job will be written to a directory inside the container. When the container exits that data will be lost. In order to preserve the output of the job, continue to the next step. +`docker pull gobblin/gobblin-standalone:alpine-gaas-latest` -* Preserving the output of a Docker container requires using a [data volume](https://docs.docker.com/engine/tutorials/dockervolumes/). To do this, run the below command: +`docker run gobblin/gobblin-standalone:alpine-gaas-latest` -``` -docker run -v /home/gobblin/work-dir:/home/gobblin/work-dir gobblin-wikipedia -``` +After the container spins up, put the [wikipedia.pull](https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull) in ${LOCAL_JOB_DIR}. You will see the Gobblin daemon pick up the job, and the result output is in ${LOCAL_JOB_DIR}/job-output/. -The output of the Gobblin-Wikipedia job should now be written to `/home/gobblin/work-dir/job-output`. The `-v` command in Docker uses a feature of Docker called [data volumes](https://docs.docker.com/engine/tutorials/dockervolumes/). The `-v` option mounts a host directory into a container and is of the form `[host-directory]:[container-directory]`. Now any modifications to the host directory can be seen inside the container-directory, and any modifications to the container-directory can be seen inside the host-directory. This is a standard way to ensure data persists even after a Docker container finishes. It's important to note that the `[host-directory]` in the `-v` option can be changed to any directory (on OSX it must be under the `/Users/` directory), but the `[container-directory]` must remain `/home/gobblin/work-dir` (at least for now). +This example job is correspondent to the [getting started guide](https://gobblin.readthedocs.io/en/latest/Getting-Started/). With the docker image, you can focus on the Gobblin functionalities, avoiding the hassle of building a distribution. -## Gobblin-Standalone Repository +### Use Gobblin Standalone on Docker for Kafka and HDFS Ingestion -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. +* To ingest from/to Kafka and HDFS by Gobblin, you need to start services for Zookeeper, Kafka and HDFS along with Gobblin. We use docker [compose](https://docs.docker.com/compose/) with images contributed to docker hub. Firstly, you need to create a [docker-compose.yml](https://github.com/apache/incubator-gobblin/blob/master/gobblin-docker/gobblin-recipes/kafka-hdfs/docker-compose.yml) file(copy the content into your .yml file). Review comment: We shouldn't ask them to copy the content into a local docker-compose if they just want to run some examples or try it out I think. We can just link directly to the docker compose recipe here. ########## File path: gobblin-docs/user-guide/Docker-Integration.md ########## @@ -18,68 +18,68 @@ The `gobblin/gobblin-wikipedia` repository contains images that run the Gobblin The `gobblin/gobblin-standalone` repository contains images that run a [Gobblin standalone service](Gobblin-Deployment#standalone-architecture) inside a Docker container. These images provide an easy and simple way to setup a Gobblin standalone service on any Docker compatible machine. +The `gobblin/gobblin-service` repository contains images that run [Gobblin as a service](Building-Gobblin-as-a-Service#running-gobblin-as-a-service-with-docker), which is a service that takes in a user request (a logical flow) and converts it into a series of Gobblin Jobs, and monitors these jobs in a distributed manner. + The `gobblin/gobblin-base` and `gobblin/gobblin-distributions` repositories are for internal use only, and are primarily useful for Gobblin developers. -## Gobblin-Wikipedia Repository +# Run Gobblin Standalone + +The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-wikipedia/). These images are mainly meant to act as a "Hello World" example for the Gobblin-Docker integration, and to provide a sanity check to see if the Gobblin-Docker integration is working on a given machine. The image contains the Gobblin configuration files to run the [Gobblin Wikipedia job](../Getting-Started). When a container is launched using the `gobblin-wikipedia` image, Gobblin starts up, runs the Wikipedia example, and then exits. +### Set working directory -Running the `gobblin-wikipedia` image requires taking following steps (lets assume we want to an Ubuntu based image): +Before running docker containers, set a working directory for Gobblin jobs: -* Download the images from the `gobblin/gobblin-wikipedia` repository +`export LOCAL_JOB_DIR=<local_gobblin_directory>` -``` -docker pull gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +We will use this directory as the [volume](https://docs.docker.com/storage/volumes/) for Gobblin jobs and outputs. Make sure your Docker has the [access](https://docs.docker.com/docker-for-mac/#file-sharing) to this folder. This is the prerequisite for all following example jobs. -* Run the `gobblin/gobblin-wikipedia:ubuntu-gobblin-latest` image in a Docker container +### Run the docker image with simple wikipedia jobs -``` -docker run gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +Run these commands to start the docker image: -The logs are printed to the console, and no errors should pop up. This should provide a nice sanity check to ensure that everything is working as expected. The output of the job will be written to a directory inside the container. When the container exits that data will be lost. In order to preserve the output of the job, continue to the next step. +`docker pull gobblin/gobblin-standalone:alpine-gaas-latest` -* Preserving the output of a Docker container requires using a [data volume](https://docs.docker.com/engine/tutorials/dockervolumes/). To do this, run the below command: +`docker run gobblin/gobblin-standalone:alpine-gaas-latest` Review comment: This wouldn't pick up any jobs unless we expose `LOCAL_JOB_DIR` as a docker volume to `/tmp/gobblin-standalone/jobs`, where gobblin standalone picks up its jobs. Usually not a problem in docker-compose where we can define volumes but if we want to run the docker image directly it'd look something like this `docker run -v $LOCAL_JOB_DIR:/tmp/gobblin-standalone/jobs gobblin/gobblin-standalone:latest` Also when I tried running the wikipedia example it gives me this error: ``` Thread Thread[JobScheduler-0,5,main] threw an uncaught exception: java.lang.VerifyError: Stack map does not match the one at exception handler 77 ``` Did we confirm that the example is still working? ########## File path: gobblin-docs/user-guide/Docker-Integration.md ########## @@ -18,68 +18,68 @@ The `gobblin/gobblin-wikipedia` repository contains images that run the Gobblin The `gobblin/gobblin-standalone` repository contains images that run a [Gobblin standalone service](Gobblin-Deployment#standalone-architecture) inside a Docker container. These images provide an easy and simple way to setup a Gobblin standalone service on any Docker compatible machine. +The `gobblin/gobblin-service` repository contains images that run [Gobblin as a service](Building-Gobblin-as-a-Service#running-gobblin-as-a-service-with-docker), which is a service that takes in a user request (a logical flow) and converts it into a series of Gobblin Jobs, and monitors these jobs in a distributed manner. + The `gobblin/gobblin-base` and `gobblin/gobblin-distributions` repositories are for internal use only, and are primarily useful for Gobblin developers. -## Gobblin-Wikipedia Repository +# Run Gobblin Standalone + +The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-wikipedia/). These images are mainly meant to act as a "Hello World" example for the Gobblin-Docker integration, and to provide a sanity check to see if the Gobblin-Docker integration is working on a given machine. The image contains the Gobblin configuration files to run the [Gobblin Wikipedia job](../Getting-Started). When a container is launched using the `gobblin-wikipedia` image, Gobblin starts up, runs the Wikipedia example, and then exits. +### Set working directory -Running the `gobblin-wikipedia` image requires taking following steps (lets assume we want to an Ubuntu based image): +Before running docker containers, set a working directory for Gobblin jobs: -* Download the images from the `gobblin/gobblin-wikipedia` repository +`export LOCAL_JOB_DIR=<local_gobblin_directory>` -``` -docker pull gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +We will use this directory as the [volume](https://docs.docker.com/storage/volumes/) for Gobblin jobs and outputs. Make sure your Docker has the [access](https://docs.docker.com/docker-for-mac/#file-sharing) to this folder. This is the prerequisite for all following example jobs. -* Run the `gobblin/gobblin-wikipedia:ubuntu-gobblin-latest` image in a Docker container +### Run the docker image with simple wikipedia jobs -``` -docker run gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +Run these commands to start the docker image: -The logs are printed to the console, and no errors should pop up. This should provide a nice sanity check to ensure that everything is working as expected. The output of the job will be written to a directory inside the container. When the container exits that data will be lost. In order to preserve the output of the job, continue to the next step. +`docker pull gobblin/gobblin-standalone:alpine-gaas-latest` -* Preserving the output of a Docker container requires using a [data volume](https://docs.docker.com/engine/tutorials/dockervolumes/). To do this, run the below command: +`docker run gobblin/gobblin-standalone:alpine-gaas-latest` -``` -docker run -v /home/gobblin/work-dir:/home/gobblin/work-dir gobblin-wikipedia -``` +After the container spins up, put the [wikipedia.pull](https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull) in ${LOCAL_JOB_DIR}. You will see the Gobblin daemon pick up the job, and the result output is in ${LOCAL_JOB_DIR}/job-output/. -The output of the Gobblin-Wikipedia job should now be written to `/home/gobblin/work-dir/job-output`. The `-v` command in Docker uses a feature of Docker called [data volumes](https://docs.docker.com/engine/tutorials/dockervolumes/). The `-v` option mounts a host directory into a container and is of the form `[host-directory]:[container-directory]`. Now any modifications to the host directory can be seen inside the container-directory, and any modifications to the container-directory can be seen inside the host-directory. This is a standard way to ensure data persists even after a Docker container finishes. It's important to note that the `[host-directory]` in the `-v` option can be changed to any directory (on OSX it must be under the `/Users/` directory), but the `[container-directory]` must remain `/home/gobblin/work-dir` (at least for now). +This example job is correspondent to the [getting started guide](https://gobblin.readthedocs.io/en/latest/Getting-Started/). With the docker image, you can focus on the Gobblin functionalities, avoiding the hassle of building a distribution. -## Gobblin-Standalone Repository +### Use Gobblin Standalone on Docker for Kafka and HDFS Ingestion -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. +* To ingest from/to Kafka and HDFS by Gobblin, you need to start services for Zookeeper, Kafka and HDFS along with Gobblin. We use docker [compose](https://docs.docker.com/compose/) with images contributed to docker hub. Firstly, you need to create a [docker-compose.yml](https://github.com/apache/incubator-gobblin/blob/master/gobblin-docker/gobblin-recipes/kafka-hdfs/docker-compose.yml) file(copy the content into your .yml file). Review comment: The link `https://github.com/apache/incubator-gobblin/gobblin-docker/gobblin-recipes/kafka-hdfs/docker-compose.yml` is broken. Did you mean `https://github.com/apache/incubator-gobblin/gobblin-docker/gobblin-recipes/kafka-hdfs/docker-compose.yml`? ########## File path: gobblin-docs/user-guide/Docker-Integration.md ########## @@ -18,68 +18,68 @@ The `gobblin/gobblin-wikipedia` repository contains images that run the Gobblin The `gobblin/gobblin-standalone` repository contains images that run a [Gobblin standalone service](Gobblin-Deployment#standalone-architecture) inside a Docker container. These images provide an easy and simple way to setup a Gobblin standalone service on any Docker compatible machine. +The `gobblin/gobblin-service` repository contains images that run [Gobblin as a service](Building-Gobblin-as-a-Service#running-gobblin-as-a-service-with-docker), which is a service that takes in a user request (a logical flow) and converts it into a series of Gobblin Jobs, and monitors these jobs in a distributed manner. + The `gobblin/gobblin-base` and `gobblin/gobblin-distributions` repositories are for internal use only, and are primarily useful for Gobblin developers. -## Gobblin-Wikipedia Repository +# Run Gobblin Standalone + +The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-wikipedia/). These images are mainly meant to act as a "Hello World" example for the Gobblin-Docker integration, and to provide a sanity check to see if the Gobblin-Docker integration is working on a given machine. The image contains the Gobblin configuration files to run the [Gobblin Wikipedia job](../Getting-Started). When a container is launched using the `gobblin-wikipedia` image, Gobblin starts up, runs the Wikipedia example, and then exits. +### Set working directory -Running the `gobblin-wikipedia` image requires taking following steps (lets assume we want to an Ubuntu based image): +Before running docker containers, set a working directory for Gobblin jobs: -* Download the images from the `gobblin/gobblin-wikipedia` repository +`export LOCAL_JOB_DIR=<local_gobblin_directory>` -``` -docker pull gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +We will use this directory as the [volume](https://docs.docker.com/storage/volumes/) for Gobblin jobs and outputs. Make sure your Docker has the [access](https://docs.docker.com/docker-for-mac/#file-sharing) to this folder. This is the prerequisite for all following example jobs. -* Run the `gobblin/gobblin-wikipedia:ubuntu-gobblin-latest` image in a Docker container +### Run the docker image with simple wikipedia jobs -``` -docker run gobblin/gobblin-wikipedia:ubuntu-gobblin-latest -``` +Run these commands to start the docker image: -The logs are printed to the console, and no errors should pop up. This should provide a nice sanity check to ensure that everything is working as expected. The output of the job will be written to a directory inside the container. When the container exits that data will be lost. In order to preserve the output of the job, continue to the next step. +`docker pull gobblin/gobblin-standalone:alpine-gaas-latest` -* Preserving the output of a Docker container requires using a [data volume](https://docs.docker.com/engine/tutorials/dockervolumes/). To do this, run the below command: +`docker run gobblin/gobblin-standalone:alpine-gaas-latest` -``` -docker run -v /home/gobblin/work-dir:/home/gobblin/work-dir gobblin-wikipedia -``` +After the container spins up, put the [wikipedia.pull](https://github.com/apache/incubator-gobblin/blob/master/gobblin-example/src/main/resources/wikipedia.pull) in ${LOCAL_JOB_DIR}. You will see the Gobblin daemon pick up the job, and the result output is in ${LOCAL_JOB_DIR}/job-output/. -The output of the Gobblin-Wikipedia job should now be written to `/home/gobblin/work-dir/job-output`. The `-v` command in Docker uses a feature of Docker called [data volumes](https://docs.docker.com/engine/tutorials/dockervolumes/). The `-v` option mounts a host directory into a container and is of the form `[host-directory]:[container-directory]`. Now any modifications to the host directory can be seen inside the container-directory, and any modifications to the container-directory can be seen inside the host-directory. This is a standard way to ensure data persists even after a Docker container finishes. It's important to note that the `[host-directory]` in the `-v` option can be changed to any directory (on OSX it must be under the `/Users/` directory), but the `[container-directory]` must remain `/home/gobblin/work-dir` (at least for now). +This example job is correspondent to the [getting started guide](https://gobblin.readthedocs.io/en/latest/Getting-Started/). With the docker image, you can focus on the Gobblin functionalities, avoiding the hassle of building a distribution. -## Gobblin-Standalone Repository +### Use Gobblin Standalone on Docker for Kafka and HDFS Ingestion -The Docker images for this repository can be found on Docker Hub [here](https://hub.docker.com/r/gobblin/gobblin-standalone/). These images run a Gobblin standalone service inside a Docker container. The Gobblin standalone service is a long running process that can run Gobblin jobs defined in a `.job` or `.pull` file. The job / pull files are submitted to the standalone service by placing them in a directory on the local filesystem. The standalone service monitors this directory for any new job / pull files and runs them either immediately or on a scheduled basis (more information on how this works can be found [here](Working-with-Job-Configuration-Files#adding-or-changing-job-configuration-files)). Running the Gobblin standalone service inside a Docker container allows Gobblin to pick up job / pull files from a directory on the host filesystem, run the job, and write the output back the host filesystem. All the heavy lifting is done inside a Docker container, the user just needs to worry about defining and submitting job / pull files. The goal is to provide a easy to setup environment for the Gobblin standalone service. +* To ingest from/to Kafka and HDFS by Gobblin, you need to start services for Zookeeper, Kafka and HDFS along with Gobblin. We use docker [compose](https://docs.docker.com/compose/) with images contributed to docker hub. Firstly, you need to create a [docker-compose.yml](https://github.com/apache/incubator-gobblin/blob/master/gobblin-docker/gobblin-recipes/kafka-hdfs/docker-compose.yml) file(copy the content into your .yml file). + +* Second, in the same folder of the yml file, create a [hadoop.env](https://github.com/apache/incubator-gobblin/blob/master/gobblin-docker/gobblin-recipes/kafka-hdfs/hadoop.env) file to specify all HDFS related config(copy the content into your .env file). + +* Open a terminal in the same folder, pull and run these docker services: + + `docker-compose -f ./docker-compose.yml pull` Review comment: The docker commands aren't working for me. I run into the following errors: ``` Unsupported config option for services.kafka: 'datanode1' Unsupported config option for services.gobblin-standalone: 'zookeeper' ``` Do you know what it could be? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
