bobbai00 opened a new pull request, #4675: URL: https://github.com/apache/texera/pull/4675
### What changes were proposed in this PR? > **Stacked on top of #4668.** This PR's diff against \`main\` will reduce to a single commit (the auto-generation work) once #4668 is merged. Until then, this PR shows all of #4668's commits plus the auto-generation commit. Replaces the hand-curated per-module \`NOTICE-binary\` files introduced in #4668 with output from a new generator that extracts attribution from each module's bundled jars. **New script** — \`bin/licensing/generate_notice_binary.py\`: - Walks each module's \`lib/\` dir, opens every \`*.jar\` (skips \`org.apache.texera.*\`), extracts every \`META-INF/NOTICE\` (or root-level \`NOTICE\`) file. - Dedupes by SHA-1 of normalized content; jars sharing a NOTICE collapse into one block. - Each block: \`--- 80-dash sep ---\`, project heading derived from a hand-curated \`PROJECT_NAMES\` table (longest-prefix match → e.g. \`org.apache.hadoop.\` → \`Apache Hadoop\`), sep, \"Bundled jars\" listing, verbatim upstream NOTICE. - Sorted by jar-count desc; hash tiebreaker for stable order. - Normalizes CRLF→LF so committed and regenerated outputs match byte-for-byte through git. - Optional \`--extras <file>\` appends a verbatim block (used for non-jar attributions like aiohttp + Matplotlib). **\`amber/NOTICE-binary-extras\`** (new): the aiohttp + Matplotlib blocks, since those are Python wheels not jars. **6 per-module \`NOTICE-binary\` files regenerated** — replace the curated subsets. Block counts: 24 / 24 / 87 / 92 / 88 / 91 (was 18 / 18 / 25 / 26 / 26 / 27 in #4668). Higher counts because dedup is by exact content rather than by hand-grouped upstream project, so e.g. Hadoop sub-artifacts whose \`META-INF/NOTICE\` differ slightly across versions now show as separate blocks. Every distinct attribution actually shipped is preserved verbatim — strictly more ASF-compliant under Apache-2.0 §4(d). **CI verification** — new step in \`build.yml\`'s scala job, after the existing dist-unzip + license check: \`\`\` for each module: regenerate NOTICE-binary against /tmp/dists/<module>-*/lib, diff against committed fail with a one-line fix-up command if drift \`\`\` So future dep bumps: bump in \`build.sbt\` → CI fails on NOTICE drift → run \`./bin/licensing/generate_notice_binary.py <module>/NOTICE-binary <lib-dir> [--extras …]\` → commit. ### Any related issues, documentation, discussions? Closes #4674 Depends on #4668 (this PR's base will retarget to a clean diff once #4668 lands) ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d)) ### How was this PR tested? - Generator run locally against jars extracted from \`ghcr.io/apache/texera-*:61ce334cb\` images for all 6 modules; output verified line-by-line against current curated NOTICE blocks. - CRLF→LF normalization verified: regenerated files produce byte-identical output to committed files (no spurious git auto-conversion drift). - CI step's logic exercised locally: \`generate_notice_binary.py /tmp/foo <lib-dir> --extras …\` then \`diff <module>/NOTICE-binary /tmp/foo\` → empty (clean). - Generator skips \`org.apache.texera.*\` jars (own first-party content, not third-party). ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
