Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Khalid Mammadov Thu, 13 Feb 2025 00:21:00 -0800

There's also PolarSpark that offers PySpark API with Polars engine

https://github.com/khalidmammadov/polarspark


PS. It's also in PyPi

On Wed, 12 Feb 2025, 19:56 Sem, <ssinche...@apache.org> wrote:

> DuckDB provides "PySpark syntax" on top of fast single node engine:
>
> https://duckdb.org/docs/clients/python/spark_api.html
>
> As I remember, DuckDB is much faster than pandas on a single node and it
> already provides a spark-compatible API.
>
> On 2/10/25 1:02 PM, José Müller wrote:
> >
> > Hi all,
> >
> > I'm new to the Spark community—please let me know if this isn’t the
> > right forum for feature proposals.
> >
> > *About Me:*
> > I’ve spent over 10 years in data roles, from engineering to machine
> > learning and analytics. A recurring challenge I've encountered is the
> > disconnect between data engineering and ML teams when it comes to
> > feature engineering and data transformations using Spark.
> >
> > *The Problem:*
> >
> > There is a growing disconnect between Data Engineering and Machine
> > Learning teams due to differences in the tools and languages they use
> > for data transformations.
> >
> >  *
> >
> >     *Tooling Divergence:*
> >
> >       o *Data Engineering Teams* primarily use *Spark SQL* and
> >         *PySpark* for building scalable data pipelines and
> >         transforming data across different layers (bronze, silver, gold).
> >       o *Machine Learning Teams* typically rely on *Pandas* for
> >         feature engineering, model training, and recreating features
> >         on-the-fly in production environments.
> >  *
> >
> >     *Challenges Arising from This Divergence:*
> >
> >     1.
> >
> >         *Code Duplication & Maintenance Overhead:*
> >
> >           o ML teams often need to rebuild feature pipelines from raw
> >             data or adapt data engineer outputs into Pandas, leading
> >             to duplicated effort and maintenance in two different
> >             codebases.
> >           o This forces teams to maintain expertise in multiple
> >             frameworks, increasing complexity and reducing cross-team
> >             supportability.
> >     2.
> >
> >         *Inconsistent Feature Definitions:*
> >
> >           o Transformations are implemented in different languages
> >             (PySpark for DE, Pandas for ML), which causes *feature
> >             drift*—where the same metrics or features calculated in
> >             separate environments diverge over time, leading to
> >             inconsistent results.
> >     3.
> >
> >         *Performance Bottlenecks in Production:*
> >
> >           o While PySpark can regenerate features for serving models
> >             via APIs, it introduces inefficiencies:
> >               + *Overhead for Small Payloads:* Spark's distributed
> >                 processing is unnecessary for small, real-time
> >                 requests and is slower compared to Pandas.
> >               + *Latency Issues:* Spark struggles to deliver
> >                 millisecond-level responses required in production APIs.
> >               + *Operational Complexity:* Maintaining Spark within API
> >                 environments adds unnecessary overhead.
> >     4.
> >
> >         *Data Refinement Gap:*
> >
> >           o Since ML teams are often formed after data engineering
> >             teams, they lag behind in leveraging years of
> >             PySpark-based data refinements. Reproducing the same
> >             transformations in Pandas to match the DE pipelines is
> >             time-consuming and error-prone.
> >
> > *
> > *
> > *The Ideal Scenario*
> >
> > To bridge the gap between Data Engineering and Machine Learning teams,
> > the ideal solution would:
> >
> > 1.
> >
> >     *Unify the Tech Stack:*
> >
> >       * Both Data Engineers and ML teams would use *PySpark
> >         syntax* for data transformations and feature engineering. This
> >         shared language would simplify code maintenance, improve
> >         collaboration, and reduce the need for cross-training in
> >         multiple frameworks.
> > 2.
> >
> >     *Flexible Execution Backend:*
> >
> >       * While using the same PySpark code, teams could choose the most
> >         efficient execution engine based on their needs:
> >           o *Data Engineers* would continue leveraging Spark's
> >             distributed processing for large-scale data transformations.
> >           o *ML Teams* could run the same PySpark transformations
> >             using *Pandas* as the processing engine for faster,
> >             on-the-fly feature generation in model training and API
> >             serving.
> >
> > This unified approach would eliminate redundant codebases, ensure
> > consistent feature definitions, and optimize performance across both
> > batch and real-time workflows.
> >
> >
> > *The Proposal:*
> >
> > *Introduce an API that allows PySpark syntax while processing
> > DataFrame using either Spark or Pandas depending on the session context.*
> >
> > *
> > *
> >
> > *Simple, but intuitive example:*
> >
> > import pyspark.sql.functionsas F
> >
> > def silver(bronze_df):
> >      return (
> >          bronze_df
> >          .withColumnRenamed("bronze_col","silver_col")
> >      )
> >
> > def gold(silver_df):
> >      return (
> >          silver_df
> >          .withColumnRenamed("silver_col","gold_col")
> >          .withColumn("gold_col", F.col("gold_col") +1)
> >      )
> >
> > def features(gold_df):
> >      return (
> >          gold_df
> >          .withColumnRenamed("gold_col","feature_col")
> >          .withColumn("feature_col", F.col("feature_col") +1)
> >      )
> >
> > # With the Spark Session (normal way of using PySpark) spark =
> SparkSession.builder.master("local[1]").getOrCreate()
> > bronze_df = spark.createDataFrame(schema=("bronze_col",),data=[(1,)])
> > silver_df = silver(bronze_df)
> > gold_df = gold(silver_df)
> > features_df = features(gold_df)
> > features_df.show()
> >
> > # Proposed "Pandas Spark Session" spark =
> SparkSession.builder.as_pandas.getOrCreate()
> > # This would execute the same transformations using Pandas under the
> hood.
> >
> > This would enable teams to share the same codebase while choosing the
> > most efficient processing engine.
> >
> > We've built and experimented in different data teams with a public
> > library, *flyipe
> > <
> https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*,
>
> > that uses |pandas_api()| transformations, but can run using Pandas
> > <
> https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>,
>
> > but it still requires ML teams to manage separate pipelines for Spark
> > dependencies.
> >
> > I’d love to hear thoughts from the community on this idea, and *if
> > there's a better approach to solving this issue*.
> >
> > Thanks,
> > José Müller
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Reply via email to