[GitHub] [beam] damccorm commented on a diff in pull request #23552: add website page about data processing for ML

GitBox Mon, 10 Oct 2022 06:47:00 -0700


damccorm commented on code in PR #23552:
URL: https://github.com/apache/beam/pull/23552#discussion_r991304575



##########
website/www/site/content/en/documentation/ml/data-processing.md:
##########
@@ -0,0 +1,117 @@
+---
+title: "Overview"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Data processing
+
+There are several types of data processing in Beam that are applicable in any 
AI/ML project:
+- Data exploration: getting to know your data (properties, distributions, 
statistics) at the start of the development of your project, or when there are 
significant changes to your data.
+- Data preprocessing: transforming your data so that it is ready to be used 
for training your model.
+- Data post-processing: after running inference, you might need to transform 
the output of your model so that it is meaningful
+- Data validation: check the quality of your data such as detecting outliers 
and reporting on standard deviations and class distributions.
+
+This can be grouped into two main topics. We will look at data exploration 
first and secondly at data pipelines in ML which consists of both data 
preprocessing and validation. Data post-processing is not discussed explicitly 
here as this is in essence the same as preprocessing, but differs in only the 
order and type of pipeline.
+
+## Initial data exploration
+
+A popular tool to perform data exploration is 
[Pandas](https://pandas.pydata.org/). Pandas is a data analysis and 
manipulation tool for Python. It uses DataFrames, which is a data structure 
that contains two-dimensional tabular data and provides labeled rows and 
columns for the data. The Apache Beam Python SDK provides a [DataFrame 
API](https://beam.apache.org/documentation/dsls/dataframes/overview/) for 
working with Pandas-like DataFrame objects.
+
+The Beam DataFrame API is intended to provide access to a familiar programming 
interface within a Beam pipeline. This allows you to easily perform data 
exploration, and later on re-use the same code for your data preprocessing 
pipeline. This way you can build complex data processing pipelines by only 
invoking standard Pandas commands.
+
+You can use the DataFrame API in combination with the [Beam interactive 
runner](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/README.md)
 in a [JupyterLab 
notebook](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development).
 This lets you iteratively develop pipelines and display the results of your 
individual pipeline steps.
+
+An example of data exploration in Beam in a notebook:
+
+```
+import apache_beam as beam
+from apache_beam.runners.interactive.interactive_runner import 
InteractiveRunner
+import apache_beam.runners.interactive.interactive_beam as ib
+
+p = beam.Pipeline(InteractiveRunner())
+beam_df = p | beam.dataframe.io.read_csv(input_path)
+
+# Investigate columns and data types
+beam_df.dtypes
+
+# Generate descriptive statistics
+ib.collect(beam_df.describe())
+
+# Investigate missing values
+ib.collect(beam_df.isnull())
+```
+
+## Data pipeline for ML
+A typical data preprocessing pipeline consists of the following steps:
+1. Reading and writing data: read/write the data from your filesystem, 
database or messaging queue. Beam has a rich set of [IO 
connectors](https://beam.apache.org/documentation/io/built-in/) for ingesting 
and writing data.
+2. Data cleaning: you typically want to filter and clean your data before 
using it for your ML model. Examples of this are to remove duplicate or 
irrelevant data, correct mistakes in your dataset, filter out unwanted outliers 
and handle missing data.
+3. Data transformations: your data needs to fit the expected input your model 
needs to train. Examples of this are normalization, one-hot encoding, scaling 
and vectorizing.
+4. Data enrichment: often you will want to enrich your data with external data 
sources to make your data more meaningful or more easy to interpret by an ML 
model. An example of this is to transform a city name or address into a 
coordinate.
+5. Data validation & metrics: you also want to make sure your data adheres to 
a specific set of requirements that can be validated in your pipeline. And you 
might want to report some metrics from your data such as the class 
distributions.
+
+All these steps can successfully be implemented in a Beam pipeline. This is a 
minimalistic example of such a pipeline that demonstrates all above mentioned 
steps:
+
+```
+import apache_beam as beam
+from apache_beam.metrics import Metrics
+
+with beam.Pipeline() as pipeline:
+  # Create data
+  input_data = (
+      pipeline
+      | beam.Create([
+         {'age': 25, 'height': 176, 'weight': 60, 'city': 'London'},
+         {'age': 61, 'height': 192, 'weight': 95, 'city': 'Brussels'},
+         {'age': 48, 'height': 163, 'weight': None, 'city': 'Berlin'}]))
+
+  # Clean data
+  def filter_missing_data(row):
+    if row['weight']:
+      yield row
+
+  cleaned_data = input_data | beam.FlatMap(filter_missing_data)

Review Comment:
   ```suggestion
     def filter_missing_data(row):
       return row['weight'] is not None
   
     cleaned_data = input_data | beam.Filter(filter_missing_data)
   ```
   
   A (slightly) cleaner way of doing this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #23552: add website page about data processing for ML

Reply via email to