Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Sean Owen Mon, 10 Feb 2025 09:17:41 -0800

I don't think this makes sense, or lacks motivation. You want teams to
convert pandas code to Pyspark syntax, only to run it on pandas? why? just
run the pandas code in a larger job that also uses Spark if you like, or
within UDFs.


If you remove this assumption that people need to convert to Pyspark APIs,
then you're also left with pandas-on-Spark as an option. Or, just, pandas.

On Mon, Feb 10, 2025 at 6:02 AM José Müller <joseheliomul...@gmail.com>
wrote:

> Hi all,
>
> I'm new to the Spark community—please let me know if this isn’t the right
> forum for feature proposals.
>
> *About Me:*
> I’ve spent over 10 years in data roles, from engineering to machine
> learning and analytics. A recurring challenge I've encountered is the
> disconnect between data engineering and ML teams when it comes to feature
> engineering and data transformations using Spark.
>
> *The Problem:*
>
> There is a growing disconnect between Data Engineering and Machine
> Learning teams due to differences in the tools and languages they use for
> data transformations.
>
>    -
>
>    *Tooling Divergence:*
>    - *Data Engineering Teams* primarily use *Spark SQL* and *PySpark* for
>       building scalable data pipelines and transforming data across different
>       layers (bronze, silver, gold).
>       - *Machine Learning Teams* typically rely on *Pandas* for feature
>       engineering, model training, and recreating features on-the-fly in
>       production environments.
>    -
>
>    *Challenges Arising from This Divergence:*
>    1.
>
>       *Code Duplication & Maintenance Overhead:*
>       - ML teams often need to rebuild feature pipelines from raw data or
>          adapt data engineer outputs into Pandas, leading to duplicated 
> effort and
>          maintenance in two different codebases.
>          - This forces teams to maintain expertise in multiple
>          frameworks, increasing complexity and reducing cross-team 
> supportability.
>       2.
>
>       *Inconsistent Feature Definitions:*
>       - Transformations are implemented in different languages (PySpark
>          for DE, Pandas for ML), which causes *feature drift*—where the
>          same metrics or features calculated in separate environments diverge 
> over
>          time, leading to inconsistent results.
>       3.
>
>       *Performance Bottlenecks in Production:*
>       - While PySpark can regenerate features for serving models via
>          APIs, it introduces inefficiencies:
>             - *Overhead for Small Payloads:* Spark's distributed
>             processing is unnecessary for small, real-time requests and is 
> slower
>             compared to Pandas.
>             - *Latency Issues:* Spark struggles to deliver
>             millisecond-level responses required in production APIs.
>             - *Operational Complexity:* Maintaining Spark within API
>             environments adds unnecessary overhead.
>          4.
>
>       *Data Refinement Gap:*
>       - Since ML teams are often formed after data engineering teams,
>          they lag behind in leveraging years of PySpark-based data 
> refinements.
>          Reproducing the same transformations in Pandas to match the DE 
> pipelines is
>          time-consuming and error-prone.
>
>
> *The Ideal Scenario*
>
> To bridge the gap between Data Engineering and Machine Learning teams, the
> ideal solution would:
>
>    1.
>
>    *Unify the Tech Stack:*
>    - Both Data Engineers and ML teams would use *PySpark syntax* for data
>       transformations and feature engineering. This shared language would
>       simplify code maintenance, improve collaboration, and reduce the need 
> for
>       cross-training in multiple frameworks.
>    2.
>
>    *Flexible Execution Backend:*
>    - While using the same PySpark code, teams could choose the most
>       efficient execution engine based on their needs:
>          - *Data Engineers* would continue leveraging Spark's distributed
>          processing for large-scale data transformations.
>          - *ML Teams* could run the same PySpark transformations using
>          *Pandas* as the processing engine for faster, on-the-fly feature
>          generation in model training and API serving.
>
> This unified approach would eliminate redundant codebases, ensure
> consistent feature definitions, and optimize performance across both batch
> and real-time workflows.
>
>
> *The Proposal:*
>
> *Introduce an API that allows PySpark syntax while processing DataFrame
> using either Spark or Pandas depending on the session context.*
>
>
> *Simple, but intuitive example:*
>
> import pyspark.sql.functions as F
>
> def silver(bronze_df):
>     return (
>         bronze_df
>         .withColumnRenamed("bronze_col", "silver_col")
>     )
>
> def gold(silver_df):
>     return (
>         silver_df
>         .withColumnRenamed("silver_col", "gold_col")
>         .withColumn("gold_col", F.col("gold_col") + 1)
>     )
>
> def features(gold_df):
>     return (
>         gold_df
>         .withColumnRenamed("gold_col", "feature_col")
>         .withColumn("feature_col", F.col("feature_col") + 1)
>     )
>
> # With the Spark Session (normal way of using PySpark)
> spark = SparkSession.builder.master("local[1]").getOrCreate()
> bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)])
> silver_df = silver(bronze_df)
> gold_df = gold(silver_df)
> features_df = features(gold_df)
> features_df.show()
>
> # Proposed "Pandas Spark Session"
> spark = SparkSession.builder.as_pandas.getOrCreate()
> # This would execute the same transformations using Pandas under the hood.
>
> This would enable teams to share the same codebase while choosing the most
> efficient processing engine.
>
> We've built and experimented in different data teams with a public
> library, *flyipe
> <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*,
> that uses pandas_api() transformations, but can run using Pandas
> <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>,
> but it still requires ML teams to manage separate pipelines for Spark
> dependencies.
>
> I’d love to hear thoughts from the community on this idea, and *if
> there's a better approach to solving this issue*.
>
> Thanks,
> José Müller
>

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Reply via email to