Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Mich Talebzadeh Mon, 10 Feb 2025 04:27:15 -0800

Regardless there are technical limitations here. For example pandas
operates in-memory on a single machine like the driver, making it
unsuitable for large-scale datasets that exceed the memory or processing
capacity of a single node. Unlike pySpark, pandas cannot distribute
workloads across a cluster, Have you looked at Koalas which I believe is
currently integrated as pyspark.pandas?


HTH

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Mon, 10 Feb 2025 at 12:04, José Müller <joseheliomul...@gmail.com> wrote:

> Hi all,
>
> I'm new to the Spark community—please let me know if this isn’t the right
> forum for feature proposals.
>
> *About Me:*
> I’ve spent over 10 years in data roles, from engineering to machine
> learning and analytics. A recurring challenge I've encountered is the
> disconnect between data engineering and ML teams when it comes to feature
> engineering and data transformations using Spark.
>
> *The Problem:*
>
> There is a growing disconnect between Data Engineering and Machine
> Learning teams due to differences in the tools and languages they use for
> data transformations.
>
>    -
>
>    *Tooling Divergence:*
>    - *Data Engineering Teams* primarily use *Spark SQL* and *PySpark* for
>       building scalable data pipelines and transforming data across different
>       layers (bronze, silver, gold).
>       - *Machine Learning Teams* typically rely on *Pandas* for feature
>       engineering, model training, and recreating features on-the-fly in
>       production environments.
>    -
>
>    *Challenges Arising from This Divergence:*
>    1.
>
>       *Code Duplication & Maintenance Overhead:*
>       - ML teams often need to rebuild feature pipelines from raw data or
>          adapt data engineer outputs into Pandas, leading to duplicated 
> effort and
>          maintenance in two different codebases.
>          - This forces teams to maintain expertise in multiple
>          frameworks, increasing complexity and reducing cross-team 
> supportability.
>       2.
>
>       *Inconsistent Feature Definitions:*
>       - Transformations are implemented in different languages (PySpark
>          for DE, Pandas for ML), which causes *feature drift*—where the
>          same metrics or features calculated in separate environments diverge 
> over
>          time, leading to inconsistent results.
>       3.
>
>       *Performance Bottlenecks in Production:*
>       - While PySpark can regenerate features for serving models via
>          APIs, it introduces inefficiencies:
>             - *Overhead for Small Payloads:* Spark's distributed
>             processing is unnecessary for small, real-time requests and is 
> slower
>             compared to Pandas.
>             - *Latency Issues:* Spark struggles to deliver
>             millisecond-level responses required in production APIs.
>             - *Operational Complexity:* Maintaining Spark within API
>             environments adds unnecessary overhead.
>          4.
>
>       *Data Refinement Gap:*
>       - Since ML teams are often formed after data engineering teams,
>          they lag behind in leveraging years of PySpark-based data 
> refinements.
>          Reproducing the same transformations in Pandas to match the DE 
> pipelines is
>          time-consuming and error-prone.
>
>
> *The Ideal Scenario*
>
> To bridge the gap between Data Engineering and Machine Learning teams, the
> ideal solution would:
>
>    1.
>
>    *Unify the Tech Stack:*
>    - Both Data Engineers and ML teams would use *PySpark syntax* for data
>       transformations and feature engineering. This shared language would
>       simplify code maintenance, improve collaboration, and reduce the need 
> for
>       cross-training in multiple frameworks.
>    2.
>
>    *Flexible Execution Backend:*
>    - While using the same PySpark code, teams could choose the most
>       efficient execution engine based on their needs:
>          - *Data Engineers* would continue leveraging Spark's distributed
>          processing for large-scale data transformations.
>          - *ML Teams* could run the same PySpark transformations using
>          *Pandas* as the processing engine for faster, on-the-fly feature
>          generation in model training and API serving.
>
> This unified approach would eliminate redundant codebases, ensure
> consistent feature definitions, and optimize performance across both batch
> and real-time workflows.
>
>
> *The Proposal:*
>
> *Introduce an API that allows PySpark syntax while processing DataFrame
> using either Spark or Pandas depending on the session context.*
>
>
> *Simple, but intuitive example:*
>
> import pyspark.sql.functions as F
>
> def silver(bronze_df):
>     return (
>         bronze_df
>         .withColumnRenamed("bronze_col", "silver_col")
>     )
>
> def gold(silver_df):
>     return (
>         silver_df
>         .withColumnRenamed("silver_col", "gold_col")
>         .withColumn("gold_col", F.col("gold_col") + 1)
>     )
>
> def features(gold_df):
>     return (
>         gold_df
>         .withColumnRenamed("gold_col", "feature_col")
>         .withColumn("feature_col", F.col("feature_col") + 1)
>     )
>
> # With the Spark Session (normal way of using PySpark)
> spark = SparkSession.builder.master("local[1]").getOrCreate()
> bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)])
> silver_df = silver(bronze_df)
> gold_df = gold(silver_df)
> features_df = features(gold_df)
> features_df.show()
>
> # Proposed "Pandas Spark Session"
> spark = SparkSession.builder.as_pandas.getOrCreate()
> # This would execute the same transformations using Pandas under the hood.
>
> This would enable teams to share the same codebase while choosing the most
> efficient processing engine.
>
> We've built and experimented in different data teams with a public
> library, *flyipe
> <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*,
> that uses pandas_api() transformations, but can run using Pandas
> <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>,
> but it still requires ML teams to manage separate pipelines for Spark
> dependencies.
>
> I’d love to hear thoughts from the community on this idea, and *if
> there's a better approach to solving this issue*.
>
> Thanks,
> José Müller
>

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Reply via email to