liferoad commented on code in PR #34943: URL: https://github.com/apache/beam/pull/34943#discussion_r2105034573
########## website/www/site/content/en/case-studies/akvelon.md: ########## @@ -17,3 +26,154 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> +<div class="case-study-opinion"> + <div class="case-study-opinion-img"> + <img src="/images/logos/powered-by/akvelon.png"/> + </div> + <blockquote class="case-study-quote-block"> + <p class="case-study-quote-text"> + “To support data privacy and pipeline reusability at scale, Akvelon developed Beam-based solutions for Protegrity and a major North American credit reporting company, enabling tokenization with Dataflow Flex Templates. Akvelon also built a CDAP Connector to integrate CDAP plugins with Apache Beam, enabling plugin reuse and multi-runtime compatibility.” + </p> + <div class="case-study-quote-author"> + <div class="case-study-quote-author-img"> + <img src="/images/case-study/akvelon/pikle.png"> + </div> + <div class="case-study-quote-author-info"> + <div class="case-study-quote-author-name"> + Ashley Pikle + </div> + <div class="case-study-quote-author-position"> + Director of AI Business Development @Akvelon + </div> + </div> + </div> + </blockquote> +</div> +<div class="case-study-post"> + +# Secure and Interoperable Apache Beam Pipelines by Akvelon + +## Background + +To meet growing enterprise needs for secure, scalable, and interoperable data processing pipelines, **Akvelon** developed multiple Apache Beam-powered solutions tailored for real-world production environments: +- Data tokenization and detokenization capabilities for **Protegrity** and a leading North American credit reporting company +- A connector layer to integrate **CDAP** plugins into Apache Beam pipelines + +By leveraging [Apache Beam](https://beam.apache.org/) and [Google Cloud Dataflow](https://cloud.google.com/products/dataflow?hl=en), Akvelon enabled its clients to achieve scalable data protection, regulatory compliance, and platform interoperability through reusable, open-source pipeline components. + +## Use Case 1: Data Tokenization for Protegrity and a Leading Credit Reporting Company + +### The Challenge + +**Protegrity**, a leading enterprise data-security vendor, sought to enhance its data protection platform with scalable tokenization support for batch and streaming data. Their goal: allow customers such as a major North American credit reporting company to tokenize sensitive data using Google Cloud Dataflow. The solution needed to be fast, secure, reusable, and compliant with privacy regulations (e.g., HIPAA, GDPR). + +### The Solution + +Akvelon designed and implemented a **Dataflow Flex Template** using Apache Beam that allows users to tokenize and detokenize sensitive data within both batch and streaming pipelines. + +<div class="post-scheme"> + <a href="/images/case-study/akvelon/diagram-01.png" target="_blank" title="Click to enlarge"> + <img src="/images/case-study/akvelon/diagram-01.png" alt="Protegrity & Equifax Tokenization Pipeline"> + </a> +</div> + +### Key features +- **Seamless integration with Protegrity UDFs**, enabling native tokenization directly within Beam transforms without requiring external service orchestration +- **Support for multiple data formats** such as CSV, JSON, Parquet, allowing flexible deployment across diverse data pipelines +- **Stateful processing with `DoFn` and timers**, which improves streaming reliability and reduces overall pipeline latency +- **Full compatibility with Google Cloud Dataflow**, ensuring autoscaling, fault tolerance, and operational simplicity through managed Apache Beam execution + +This design provided both Protegrity and its enterprise clients with a reusable, open-source architecture for scalable data privacy and processing. + +### The Results +- Enabled data tokenization at scale for regulated industries +- Accelerated adoption of Dataflow templates across Protegrity’s customer base +- Delivered an open-source Flex Template that benefits the entire Apache Beam community + +<blockquote class="case-study-quote-block case-study-quote-wrapped"> + <p class="case-study-quote-text"> + In collaboration with Akvelon, Protegrity utilized a Dataflow Flex template that helps us enable customers to tokenize and detokenize streaming and batch data from a fully managed Google Cloud Dataflow service. We appreciate Akvelon’s support as a trusted partner with Google Cloud expertise. + </p> + <div class="case-study-quote-author"> + <div class="case-study-quote-author-img"> + <img src="/images/case-study/akvelon/chitnis.png"> + </div> + <div class="case-study-quote-author-info"> + <div class="case-study-quote-author-name"> + Jay Chitnis + </div> + <div class="case-study-quote-author-position"> + VP of Partners and Business Development @Protegrity + </div> + </div> + </div> +</blockquote> + +## Use Case 2: CDAP Connector for Apache Beam + +### The Challenge + +**CDAP** had extensive plugin support for Spark but lacked native compatibility with Apache Beam. This limitation prevented organizations from reusing CDAP's rich ecosystem of data connectors (e.g., Salesforce, HubSpot, ServiceNow) within Beam-based pipelines, constraining cross-platform integration. + +### The Solution + +Akvelon engineered a **shim layer** (CDAP Connector) that bridges CDAP plugins with Apache Beam. This innovation enables CDAP source and sink plugins to operate seamlessly within Beam pipelines. + +<div class="post-scheme"> + <a href="/images/case-study/akvelon/diagram-02.png" target="_blank" title="Click to enlarge"> + <img src="/images/case-study/akvelon/diagram-02.png" alt="CDAP Connector Integration with Apache Beam"> + </a> +</div> + +### Highlights + +- Supports `StructuredRecord` format conversion to Beam schema (`BeamRow`) +- Enables CDAP plugins to run seamlessly in both Spark and Beam pipelines +- Facilitates integration testing across third-party data sources (e.g., Salesforce, Zendesk) +- Complies with Beam’s development and style guide for open-source contributions + +The project included prototyping, test infrastructure, and Salesforce plugin pipelines to ensure robustness. + +### The Results + +- Made **CDAP plugins reusable in Beam pipelines** Review Comment: for both cases, adding quantifiable metrics, if available (e.g., percentage increase or number of customers, data volume), will be much better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org