Hi, Just out of pure curiosity, how does this use case come to be? Do users run third-party queries in Wayang with their data sources without taking care?
It reads like most of the compliance problems this proposal describes can be avoided by properly configuring the system and code that is about to be run. Would be interesting to hear about the origin of this from you. Best, Juri ________________________________ From: Alexander Alten <a...@scalytics.io> Sent: 25 March 2025 11:55 To: dev@wayang.apache.org <dev@wayang.apache.org> Subject: Re: Feature Proposal: Outbound Compliance Control for Data Sinks in Apache Wayang Hi, First, I think that’s a good proposal, thank you! Not sure tbh if something like that falls under the scope of Wayang or underlining data protection policies, defined by the user of Wayang. On the contrary, some kind of reporting API to report intentional or unintentional data drainage would be a great addition. Wayang, as an in-situ data processing framework, needs to have such an extension, but that’s just my point of view. Such an API could be a great point for other open-source projects to dock into Wayang and provide necessary policy management interfaces. Best, —Alex > On Mar 25, 2025, at 10:33, Mirko Kämpf <mirko.kae...@gmail.com> wrote: > > Hey Wayang community, > > We’d like to open a discussion on the topic of *compliance-aware execution* > in Apache Wayang—particularly addressing the concern of *preventing > unintended outbound data transfers* when executing jobs across hybrid or > multi-tenant clusters. > > > *1. Title* > > *Outbound Traffic Validation & Compliance Optimizer Extension* > > > *2. Motivation* > > The question came up from users: > > *“How can we guarantee that Apache Wayang jobs do not export data outside > the secured boundary of the cluster?”* > > Today, this guarantee is mostly achieved externally—through firewalls and > infrastructure-level fencing. However, for stricter compliance > environments, *enforcing such guarantees at the logical plan level* would > significantly increase trust and transparency in Wayang-based data > workflows. > > > *3. Proposal Summary* > > We propose introducing an *Outbound Compliance Mode* to Wayang’s query > optimizer. In this mode, all sink operations would be validated against a > configurable set of *allowed target clusters or zones*. The validation > could occur during optimization and/or execution plan generation, and could > log or reject non-compliant plans. > > This mechanism would ensure that Wayang jobs cannot accidentally or > intentionally route data to non-approved sinks. > > > *4. Detailed Description* > > This feature consists of two layers: > > *Level 1: Infrastructure Fencing (external)* > > Outbound traffic is blocked by firewalls or network policies. This is > already widely used and provides basic protection. > > *Level 2: Active Flow Control (in Wayang)* > > An extension to the query optimizer could validate all *sink operators* > against a whitelist of approved destinations, possibly defined in > configuration or via rule sets (e.g., allowlist of URIs, target types, or > data zones). > > > We envision: > > • A *Compliance Query Optimizer Extension* (activated optionally) > > • Declarative rules to validate: > > • Sink destination type (e.g., JDBC, S3, HDFS) > > • Target host or region (e.g., EU-only) > > • Sink configuration (e.g., encryption on/off) > > • Rejection or logging of plans that violate compliance rules > > > Optionally, this could be extended to support: > > • *CORS-like logic for data sinks*, where the sink declares allowed inbound > data zones > > • *Smart contract-based approvals* for external writes, with enforced > logging or audit trails > > > This would provide enterprise-grade compliance guarantees *at the planning > layer*—beyond what firewalls alone can enforce. > > > *5. Alternatives Considered* > > The standard approach today is *relying on network firewalls* and > infrastructure-level policies. However, these do not provide visibility or > explainability inside the Wayang job planning phase. > > Another approach is *static code analysis*, but this would be outside > Wayang and harder to maintain. > > *6. Next Steps / Call for Feedback* > > We’re happy to draft a design proposal or implement a prototype if this > direction is of interest to the community. > > We’d especially welcome input on: > > • Where in the optimizer pipeline this logic should live > > • Whether this aligns with existing security/privacy goals > > • Integration with metadata or provenance tracking > > > Looking forward to your thoughts! > > > Best regards, > > Mirko > > -- > Dr. Mirko Kämpf > *Gründer & Coach * > *maindset.ACADEMY* -- *Scalytics Connect* The foundation for secure, scalable, and transparent AI. -- 3401 N. MIAMI AVE. STE 230 33127 Miami, Florida United States www.scalytics.io<http://www.scalytics.io> <http://www.scalytics.io> -- Please consider the environment before printing this email -- Disclaimer: The content of this message is confidential. If you have received it by mistake, please inform us by an email reply and then delete the message. It is forbidden to copy, forward, or in any way reveal the contents of this message to anyone. The integrity and security of this email cannot be guaranteed over the Internet. Therefore, the sender will not be held liable for any damage caused by the message.