Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Sem Wed, 12 Feb 2025 11:56:28 -0800

DuckDB provides "PySpark syntax" on top of fast single node engine:


https://duckdb.org/docs/clients/python/spark_api.html

As I remember, DuckDB is much faster than pandas on a single node and italready provides a spark-compatible API.


On 2/10/25 1:02 PM, José Müller wrote:


Hi all,

I'm new to the Spark community—please let me know if this isn’t theright forum for feature proposals.


*About Me:*

I’ve spent over 10 years in data roles, from engineering to machinelearning and analytics. A recurring challenge I've encountered is thedisconnect between data engineering and ML teams when it comes tofeature engineering and data transformations using Spark.


*The Problem:*

There is a growing disconnect between Data Engineering and MachineLearning teams due to differences in the tools and languages they usefor data transformations.


 *

    *Tooling Divergence:*

      o *Data Engineering Teams* primarily use *Spark SQL* and
        *PySpark* for building scalable data pipelines and
        transforming data across different layers (bronze, silver, gold).
      o *Machine Learning Teams* typically rely on *Pandas* for
        feature engineering, model training, and recreating features
        on-the-fly in production environments.
 *

    *Challenges Arising from This Divergence:*

    1.

        *Code Duplication & Maintenance Overhead:*

          o ML teams often need to rebuild feature pipelines from raw
            data or adapt data engineer outputs into Pandas, leading
            to duplicated effort and maintenance in two different
            codebases.
          o This forces teams to maintain expertise in multiple
            frameworks, increasing complexity and reducing cross-team
            supportability.
    2.

        *Inconsistent Feature Definitions:*

          o Transformations are implemented in different languages
            (PySpark for DE, Pandas for ML), which causes *feature
            drift*—where the same metrics or features calculated in
            separate environments diverge over time, leading to
            inconsistent results.
    3.

        *Performance Bottlenecks in Production:*

          o While PySpark can regenerate features for serving models
            via APIs, it introduces inefficiencies:
              + *Overhead for Small Payloads:* Spark's distributed
                processing is unnecessary for small, real-time
                requests and is slower compared to Pandas.
              + *Latency Issues:* Spark struggles to deliver
                millisecond-level responses required in production APIs.
              + *Operational Complexity:* Maintaining Spark within API
                environments adds unnecessary overhead.
    4.

        *Data Refinement Gap:*

          o Since ML teams are often formed after data engineering
            teams, they lag behind in leveraging years of
            PySpark-based data refinements. Reproducing the same
            transformations in Pandas to match the DE pipelines is
            time-consuming and error-prone.

*
*
*The Ideal Scenario*

To bridge the gap between Data Engineering and Machine Learning teams,the ideal solution would:


1.

    *Unify the Tech Stack:*

      * Both Data Engineers and ML teams would use *PySpark
        syntax* for data transformations and feature engineering. This
        shared language would simplify code maintenance, improve
        collaboration, and reduce the need for cross-training in
        multiple frameworks.
2.

    *Flexible Execution Backend:*

      * While using the same PySpark code, teams could choose the most
        efficient execution engine based on their needs:
          o *Data Engineers* would continue leveraging Spark's
            distributed processing for large-scale data transformations.
          o *ML Teams* could run the same PySpark transformations
            using *Pandas* as the processing engine for faster,
            on-the-fly feature generation in model training and API
            serving.

This unified approach would eliminate redundant codebases, ensureconsistent feature definitions, and optimize performance across bothbatch and real-time workflows.



*The Proposal:*

*Introduce an API that allows PySpark syntax while processingDataFrame using either Spark or Pandas depending on the session context.*


*
*

*Simple, but intuitive example:*

import pyspark.sql.functionsas F

def silver(bronze_df):
     return (
         bronze_df
         .withColumnRenamed("bronze_col","silver_col")
     )

def gold(silver_df):
     return (
         silver_df
         .withColumnRenamed("silver_col","gold_col")
         .withColumn("gold_col", F.col("gold_col") +1)
     )

def features(gold_df):
     return (
         gold_df
         .withColumnRenamed("gold_col","feature_col")
         .withColumn("feature_col", F.col("feature_col") +1)
     )

# With the Spark Session (normal way of using PySpark) spark = 
SparkSession.builder.master("local[1]").getOrCreate()
bronze_df = spark.createDataFrame(schema=("bronze_col",),data=[(1,)])
silver_df = silver(bronze_df)
gold_df = gold(silver_df)
features_df = features(gold_df)
features_df.show()

# Proposed "Pandas Spark Session" spark = 
SparkSession.builder.as_pandas.getOrCreate()
# This would execute the same transformations using Pandas under the hood.

This would enable teams to share the same codebase while choosing themost efficient processing engine.

We've built and experimented in different data teams with a publiclibrary, *flyipe<https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*,that uses |pandas_api()| transformations, but can run using Pandas<https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>,but it still requires ML teams to manage separate pipelines for Sparkdependencies.

I’d love to hear thoughts from the community on this idea, and *ifthere's a better approach to solving this issue*.


Thanks,
José Müller


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Reply via email to