[jira] [Work logged] (BEAM-11322) Apache Beam Template to tokenize sensitive data

ASF GitHub Bot (Jira) Tue, 16 Mar 2021 01:49:04 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-11322?focusedWorklogId=566783&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-566783
 ]


ASF GitHub Bot logged work on BEAM-11322:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Mar/21 08:48
            Start Date: 16/Mar/21 08:48
    Worklog Time Spent: 10m 
      Work Description: KhaninArtur commented on a change in pull request 
#13995:
URL: https://github.com/apache/beam/pull/13995#discussion_r594966044



##########
File path: 
examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/README.md
##########
@@ -0,0 +1,169 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+
+# Apache Beam pipeline example to tokenize data using remote RPC server
+
+This directory contains an Apache Beam example that creates a pipeline to read 
data from one of
+the supported sources, tokenize data with external API calls to remote RPC 
server, and write data into one of the supported sinks.
+
+Supported data formats:
+
+- JSON
+- CSV
+
+Supported input sources:
+
+- File system
+- [Google Pub/Sub](https://cloud.google.com/pubsub)
+
+Supported destination sinks:
+
+- File system
+- [Google Cloud BigQuery](https://cloud.google.com/bigquery)
+- [Cloud BigTable](https://cloud.google.com/bigtable)
+
+Supported data schema format:
+
+- JSON with an array of fields described in BigQuery format
+
+In the main scenario, the template will create an Apache Beam pipeline that 
will read data in CSV or
+JSON format from a specified input source, send the data to an external 
processing server, receive
+processed data, and write it into a specified output sink.
+
+## Requirements
+
+- Java 8
+- 1 of supported sources to read data from
+- 1 of supported destination sinks to write data into
+- A configured RPC to tokenize data
+
+## Getting Started
+
+This section describes what is needed to get the template up and running.
+
+- Gradle preparation
+- Local execution
+- Running as a Dataflow Template
+    - Setting Up Project Environment
+    - Build Data Tokenization Dataflow Flex Template
+    - Creating the Dataflow Flex Template
+    - Executing Template
+
+## Gradle preparation
+
+To run this example your `build.gradle` file should contain the following task 
to execute the pipeline:
+
+```
+task execute (type:JavaExec) {
+    main = System.getProperty("mainClass")
+    classpath = sourceSets.main.runtimeClasspath
+    systemProperties System.getProperties()
+    args System.getProperty("exec.args", "").split()
+}
+```
+
+This task allows to run the pipeline via the following command:
+
+```bash
+gradle clean execute 
-DmainClass=org.apache.beam.examples.complete.datatokenization.DataTokenization 
\
+     -Dexec.args="--<argument>=<value> --<argument>=<value>"
+```
+

Review comment:
       Yes, we plan to spread the word and blog about it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 566783)
    Time Spent: 1h  (was: 50m)

> Apache Beam Template to tokenize sensitive data
> -----------------------------------------------
>
>                 Key: BEAM-11322
>                 URL: https://issues.apache.org/jira/browse/BEAM-11322
>             Project: Beam
>          Issue Type: Improvement
>          Components: examples-java
>            Reporter: Artur Khanin
>            Assignee: Artur Khanin
>            Priority: P3
>              Labels: Clarified
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Some users may want to protect their sensitive data using tokenization.
> We propose to create a Beam example template that will demonstrate Beam 
> transform to protect sensitive data using tokenization. In our example, we 
> will use an external service for the data tokenization. 
> At a high level, a pipeline that will: 
>  * support batch (GCS) and streaming (Pub/Sub) input sources
>  * tokenize sensitive data via external REST service - we are about to use 
> Protegrity
>  * output tokenized data into BigQuery or BigTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-11322) Apache Beam Template to tokenize sensitive data

Reply via email to