MelihErduran opened a new pull request, #5079:
URL: https://github.com/apache/texera/pull/5079

   # PR: One-command local dev orchestrator for Texera (`bin/texera`)
   
   ## Summary
   
   Adds `bin/texera` — a single Bash CLI that replaces the previous "open 7 
IntelliJ run configs in the right order, then `yarn start` in a different 
terminal" workflow with one command:
   
   ```
   texera start
   ```
   
   It launches Postgres + LakeFS/MinIO + every backend JVM service + 
agent-service + frontend, in the right order, with prefixed log streams in one 
terminal, a live bottom-pinned health bar, and clean teardown on Ctrl+C. Also 
ships subcommands for `setup`, `build`, `stop`, `status`, and `logs`. 
Complementary helper `bin/check-services.sh` provides a one-shot probe outside 
an active session.
   
   This is a dev-tool addition — nothing about the services themselves changes. 
The existing `.run/*.xml` IntelliJ configs and `bin/single-node` Docker deploy 
paths are untouched.
   
   ## Motivation
   
   Before this PR, getting Texera running locally required:
   
   - Knowing the launch order (master before worker, infra before JVMs).
   - Knowing the seven `bin/*-service.sh` scripts plus the un-scripted 
agent-service and frontend.
   - Eyeballing seven different terminals to figure out whether the stack was 
actually up.
   - Manually `pkill`ing JVMs when something crashed, because there was no 
cleanup story.
   - Hitting `file-service crashed on boot` ~50% of the time when LakeFS wasn't 
quite ready.
   
   New contributors hit all of this on day one. Existing contributors lived 
with it but lost ~5 minutes per restart.
   
   ## What's in this PR
   
   ```
   bin/texera                  new   one-command orchestrator (~875 lines)
   bin/check-services.sh       new   standalone health probe (~118 lines)
   bin/build-services.sh       mod   add access-control-service to dist+unzip;
                                     rename amber zip target (texera-*.zip → 
amber-*.zip)
   ```
   
   `bin/texera` is the main feature; the other two are small.
   
   ## Subcommands
   
   ```
   texera setup           One-time: toolchain check + sbt build + 
frontend/python deps + SQL DDLs
   texera build           Re-build staged backend binaries (after backend code 
changes)
   texera start [mode]    Start services. Interactive menu if no mode given.
   texera stop            Stop everything started by `texera start`.
   texera status          Per-service port reachability table.
   texera logs <service>  Tail one service's log file.
   ```
   
   ### `texera setup`
   
   Idempotent first-time bootstrap. Verifies the toolchain (java 17, sbt, node 
24, yarn, docker, pg_isready, psql, curl, unzip), runs `bin/build-services.sh`, 
installs frontend (`yarn install`) and agent-service (`bun install`) deps, 
applies `sql/texera_ddl.sql` and `sql/iceberg_postgres_catalog.sql`. Skips 
agent-service gracefully if `bun` isn't installed.
   
   ### `texera build`
   
   Delegates to `bin/build-services.sh` (`sbt clean dist` + unzip each 
service's stage). Same path the deploy scripts use.
   
   ### `texera start`
   
   Five modes, chosen by argument or interactive menu:
   
   | Mode | Postgres + LakeFS/MinIO | Backend JVM services + agent | Frontend |
   |---|---|---|---|
   | `full` | ✓ | ✓ | ✓ |
   | `backend` | ✓ | ✓ | — |
   | `frontend` | — | — | ✓ |
   | `infra` | ✓ | — | — |
   | `services` | — | ✓ | ✓ |
   
   The interactive menu (`texera start` with no arg, TTY only) renders a 
box-drawn numbered prompt; `q` quits. Stdin not a TTY + no arg → errors with 
the list of valid modes (so it's CI-safe).
   
   Service registry is a single declarative table inside the script:
   
   ```bash
   SERVICES=(
     "config|.|target/config-service-*/bin/config-service|9094"
     
"compile|.|target/workflow-compiling-service-*/bin/workflow-compiling-service|9090"
     "file|.|target/file-service-*/bin/file-service|9092"
     
"managing|.|target/computing-unit-managing-service-*/bin/computing-unit-managing-service|8888"
     "access|.|target/access-control-service-*/bin/access-control-service|9096"
     "master|amber|target/amber-*/bin/computing-unit-master|8085"
     "worker|amber|target/amber-*/bin/computing-unit-worker|-"
     "web|amber|target/amber-*/bin/texera-web-application|8080"
   )
   ```
   
   Adding a service later means adding one row.
   
   Each row spawns from its sbt-native-packager staged binary (not `sbt 
runMain`) — that avoids the sbt boot-lock contention you get from launching 
several `sbt` processes in parallel and skips sbt startup overhead per service.
   
   Each service's stdout/stderr is piped through a colored prefixer:
   
   ```
   [config]   INFO  ConfigService starting…
   [compile]  INFO  WorkflowCompilingService starting…
   [file]     ERROR Failed to connect to lake fs server: …
   [master]   [INFO] [ClusterListener] received member event = MemberUp(...)
   ```
   
   Color is a stable hash of the service name → ANSI palette. Stream prefixer 
is `awk -v p="$prefix" '{ print p, $0; fflush(); }'`.
   
   Per-service logs also written to `logs/texera-dev/<name>.log` (so `texera 
logs <name>` works mid-run).
   
   ### `texera stop`
   
   `stop` kills every service launched by `texera start`, then `docker compose 
down`s the LakeFS/MinIO stack.
   
   The kill path matters because the previous scripts left orphan JVMs — see 
the *Hard problems* section below.
   
   ### `texera status` and `texera logs`
   
   `status` makes one `curl /api/healthcheck` per service and renders an 
aligned table (`up`/`down`). Independent of any active `texera start`. Useful 
for checking dev state from a different shell.
   
   `logs <name>` is `tail -F logs/texera-dev/<name>.log`. Names come from the 
same registry.
   
   ## Hard problems and how they're solved
   
   ### 1. Per-service liveness while logs scroll past
   
   Spawning seven JVM services into one terminal means thousands of lines of 
log spam during a normal boot. The user can't tell from the stream which 
services are up.
   
   **Solution: persistent bottom-pinned status bar.** When stdout is a TTY, 
`status_bar_init` sets the terminal scroll region via DECSTBM (`ESC[1;LINES-3 
r`), reserving the bottom 3 rows. A background poller redraws those rows every 
2 s:
   
   ```
   ═════════════════════════════════════════════════════════════
    ✓ ALL 9 SERVICES UP  (47s elapsed)
   ═════════════════════════════════════════════════════════════
   ```
   
   or on failure:
   
   ```
   ═════════════════════════════════════════════════════════════
    ✗ 2/9 DOWN: master✗ file…  (12s)
   ═════════════════════════════════════════════════════════════
   ```
   
   Symbols: `✗` = pipeline collapsed (JVM exited), `…` = process alive but port 
not yet bound.
   
   The whole 3-row redraw is one `printf` with `DECSC`/`DECRC` (save/restore 
cursor) around it, so concurrent log writes from the spawned services and the 
poller don't interleave at byte level. Worst case is one garbled frame, 
self-heals on the next 2 s tick.
   
   When stdout *isn't* a TTY (CI, `| tee log.txt`, etc), `status_bar_supported` 
returns false and the code falls back to a one-shot wait + trailing banner.
   
   Teardown lives in two places: `shutdown` (Ctrl+C trap) calls it before 
printing anything else so "shutting down…" lands in normal layout, and `trap 
status_bar_teardown EXIT` is a belt-and-suspenders safety net so the terminal 
is never left with a stuck scroll region even on an unexpected exit.
   
   ### 2. file-service vs LakeFS startup race
   
   `file-service` calls `LakeFSStorageClient.healthCheck()` during boot 
(`file-service/src/main/scala/.../FileService.scala:77`). If LakeFS isn't 
accepting HTTP, the JVM exits.
   
   `docker compose up -d` returns when the container is up, not when LakeFS's 
HTTP server is accepting connections — a 5–15 s gap. So file-service crashed 
~50% of the time on cold starts.
   
   **Solution:** `start_lakefs` now polls `http://localhost:8000/_health` 
(falling back to `/`) for up to 60 s after `docker compose up -d`, and only 
returns once LakeFS answers. Both endpoints verified to return 200 against the 
running container.
   
   ### 3. Orphan JVMs holding ports after stop
   
   The previous `bin/*-service.sh` launchers and earlier iterations of 
`bin/texera` recorded the wrong PID. The pipeline `( exec binary ) | tee log | 
prefix_stream` ends up with `$!` = the awk PID. Killing awk does not propagate 
to the JVM, which is a sibling, not a child. The fallback `pkill -f <basename>` 
didn't help either, because the launcher script's filename 
(`computing-unit-master` etc) isn't in the JVM's command line after `exec java 
-cp …`.
   
   Result: every `texera start` after the first failed with `BindException: 
127.0.0.1:2552 Address already in use`, and you'd have to `lsof -ti :2552 | 
xargs kill -9` manually.
   
   **Solution: process groups.**
   
   - Each `spawn_*` now toggles `set -m` briefly so the backgrounded pipeline 
gets its own process group. With job control on, the PGID equals the PID of the 
pipeline leader, which is the JVM subshell.
   - `pgid_of_pipeline` reads it via `ps -o pgid= -p $!` and stores it in the 
pidfile (so the pidfile effectively holds the JVM's PID, not awk's).
   - `kill_all_pgids <grace>` does `kill -TERM -- -PGID` per recorded group → 
SIGTERMs JVM + tee + awk together. Sleeps `grace` seconds. Then `kill -KILL -- 
-PGID` on any group still alive. Used by both `shutdown` (Ctrl+C, 2 s grace) 
and `stop` (subcommand, 3 s grace).
   - For JVMs left over from before this PR existed (no pidfile to consult), 
`stop` also `pkill -f <mainclass>`s each known Java mainclass:
     - `org.apache.texera.web.{ComputingUnitMaster, ComputingUnitWorker, 
TexeraWebApplication}`
     - `org.apache.texera.service.{ConfigService, FileService, 
AccessControlService, ComputingUnitManagingService, WorkflowCompilingService}`
     - List verified against `META-INF/MANIFEST.MF` in every built jar and the 
`app_mainclass=` declarations in the amber launcher scripts.
   
   Free side benefit: `is_spawn_alive` now `kill -0 PGID`s, which directly 
probes the JVM leader rather than using awk's liveness as a proxy. The status 
bar's "crashed" detection is precise.
   
   ### 4. Ordering constraints
   
   `ComputingUnitMaster` must bind its Pekko/Akka cluster port before 
`computing-unit-worker` tries to join. Encoded as one `sleep 4` after spawning 
the master row. The launch loop walks `SERVICES` in declaration order, so the 
table itself is the canonical ordering.
   
   LakeFS comes before all JVM spawns because file-service depends on it; 
Postgres comes before LakeFS because LakeFS uses it.
   
   ## File-by-file
   
   - **`bin/texera`** — entire orchestrator. Sections: service registry, mode 
table, color/printing, tool checks, infra (`ensure_postgres`, `start_lakefs`, 
`stop_lakefs`), stream prefixer, spawns, `kill_all_pgids` + `shutdown` trap, 
`status`/`logs` subcommands, `setup`/`build`, mode lookup + interactive menu, 
readiness probes (`probe_port`, `is_spawn_alive`, `wait_for_services`, 
`print_readiness_banner`), status bar, `start`, `stop`, dispatch.
   
   - **`bin/check-services.sh`** — standalone one-shot probe of every service's 
HTTP port. Independent of `texera start` session state. Prints a per-service 
table + a green/red trailing banner, exits non-zero on any failure. Useful from 
a second shell or in CI.
   
   - **`bin/build-services.sh`** — minor: adds the `access-control-service` 
unzip step that was missing, and renames the amber zip target from 
`texera-*.zip` to `amber-*.zip` to match the new artifact name.
   
   ## What's intentionally not in scope
   
   - The IntelliJ `.run/*.xml` configs still work; they're the path for 
breakpoint debugging. `texera start` is for "I want everything running, fast."
   - The `bin/single-node` Docker Compose deploy isn't touched.
   - No CI hookup added. `texera start backend` works in non-TTY mode (banner 
fallback), but no GitHub Actions job exercises it.
   
   ## Configuration knobs
   
   - `TEXERA_READY_TIMEOUT` (default 90) — seconds the one-shot non-TTY 
readiness check waits before giving up. The persistent bar polls forever; this 
only applies to the fallback path.
   - `TEXERA_HOST` (default `localhost`, used by `check-services.sh`) — host to 
probe from.
   - `TEXERA_PROBE_TIMEOUT` (default 2, used by `check-services.sh`) — 
per-probe curl timeout.
   
   LakeFS-ready timeout in `start_lakefs` is currently hard-coded at 60 s; 
making it env-configurable is a small follow-up.
   
   ## Test plan
   
   Verified locally (macOS, bash 3.2):
   
   - [x] `texera setup` from a clean checkout, then `texera start full` → menu 
→ mode 1 → all 9 services come up → bar flips green → frontend loads at `:4200`.
   - [x] Ctrl+C while running → bar disappears, scroll region restored, JVMs 
all exit within a couple seconds.
   - [x] Immediate `texera start full` again → no port conflicts (the previous 
orphan-JVM regression is gone).
   - [x] `texera stop` from a separate shell while a `start` is running → both 
terminals come back clean.
   - [x] Kill `file-service` mid-run via `pkill -f FileService` → bar flips to 
`✗ 1/9 DOWN: file✗ (… elapsed)` within 2 s.
   - [x] `texera start infra` → only Postgres + LakeFS/MinIO come up; script 
exits cleanly without blocking on `wait`.
   - [x] `texera status` from a second terminal during a healthy run → all up.
   - [x] `texera logs file` → tails `logs/texera-dev/file.log`.
   - [x] PGID/group kill round-trip verified with a synthetic `sleep | cat | 
awk` pipeline (`ps -o pid,pgid,comm -g <pgid>` empty after one TERM).
   - [x] LakeFS probe endpoints (`/_health`, `/`) both verified to return 200.
   - [x] All Java mainclass names verified against built jar manifests and 
amber launcher scripts.
   
   Not yet verified (follow-ups, see below):
   
   - Cross-terminal: only tested in macOS Terminal.app + tmux. Behavior in 
iTerm2, VS Code's embedded terminal, IntelliJ console, screen, etc. unverified.
   - Terminal resize during a run (SIGWINCH).
   - Headless / `texera start backend | tee` path through CI.
   
   ## Known limitations
   
   - **TTY only for the live bar.** Non-TTY runs fall back to a one-shot 
banner. This is intentional but means `texera start full | tee session.log` 
won't show the live view.
   - **Concurrent log writes can occasionally corrupt one bar frame.** 
Single-printf renders mitigate but don't fully eliminate byte-level 
interleaving on shared stdout. Self-heals on the next refresh.
   - **Mainclass pkill in `stop` is broad.** If you have another checkout of 
this repo running, `texera stop` here will kill that one too. Could be 
tightened with `pkill -u "$USER"`; left as-is for now since most devs run a 
single instance.
   - **`set -m` semantics vary slightly across bash versions.** Verified on 
macOS bash 3.2 and Linux bash 5.x; unusual non-POSIX shells aren't supported 
(and the shebang is `#!/usr/bin/env bash` anyway).
   
   ## Follow-ups
   
   Tracking these separately, not blocking this PR:
   
   1. `bin/README.md` section documenting subcommands, modes, env vars, and the 
status bar.
   2. Make the LakeFS readiness timeout env-configurable 
(`TEXERA_LAKEFS_TIMEOUT`).
   3. `pkill -u "$USER"` on the orphan-mainclass fallback.
   4. CI smoke job: `texera start backend` headless, assert exit code on 
readiness.
   5. `AGENTS.md` mention so subagents prefer `texera start` over the 
`bin/*-service.sh` set when bringing the stack up.
   
   ## Migration notes
   
   For existing contributors: nothing breaks. The old `bin/*-service.sh` 
scripts, IntelliJ `.run/*.xml` configs, and `bin/single-node` deploy are 
untouched and continue to work. `texera start` is opt-in.
   
   The first time you use it: `texera setup` once, then `texera start`. If 
you've ever Ctrl+C'd one of the old scripts and left an orphan JVM, run `texera 
stop` first — its mainclass fallback will clean those up.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to