Regardless there are technical limitations here. For example pandas operates in-memory on a single machine like the driver, making it unsuitable for large-scale datasets that exceed the memory or processing capacity of a single node. Unlike pySpark, pandas cannot distribute workloads across a cluster, Have you looked at Koalas which I believe is currently integrated as pyspark.pandas?
HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 10 Feb 2025 at 12:04, José Müller <joseheliomul...@gmail.com> wrote: > Hi all, > > I'm new to the Spark community—please let me know if this isn’t the right > forum for feature proposals. > > *About Me:* > I’ve spent over 10 years in data roles, from engineering to machine > learning and analytics. A recurring challenge I've encountered is the > disconnect between data engineering and ML teams when it comes to feature > engineering and data transformations using Spark. > > *The Problem:* > > There is a growing disconnect between Data Engineering and Machine > Learning teams due to differences in the tools and languages they use for > data transformations. > > - > > *Tooling Divergence:* > - *Data Engineering Teams* primarily use *Spark SQL* and *PySpark* for > building scalable data pipelines and transforming data across different > layers (bronze, silver, gold). > - *Machine Learning Teams* typically rely on *Pandas* for feature > engineering, model training, and recreating features on-the-fly in > production environments. > - > > *Challenges Arising from This Divergence:* > 1. > > *Code Duplication & Maintenance Overhead:* > - ML teams often need to rebuild feature pipelines from raw data or > adapt data engineer outputs into Pandas, leading to duplicated > effort and > maintenance in two different codebases. > - This forces teams to maintain expertise in multiple > frameworks, increasing complexity and reducing cross-team > supportability. > 2. > > *Inconsistent Feature Definitions:* > - Transformations are implemented in different languages (PySpark > for DE, Pandas for ML), which causes *feature drift*—where the > same metrics or features calculated in separate environments diverge > over > time, leading to inconsistent results. > 3. > > *Performance Bottlenecks in Production:* > - While PySpark can regenerate features for serving models via > APIs, it introduces inefficiencies: > - *Overhead for Small Payloads:* Spark's distributed > processing is unnecessary for small, real-time requests and is > slower > compared to Pandas. > - *Latency Issues:* Spark struggles to deliver > millisecond-level responses required in production APIs. > - *Operational Complexity:* Maintaining Spark within API > environments adds unnecessary overhead. > 4. > > *Data Refinement Gap:* > - Since ML teams are often formed after data engineering teams, > they lag behind in leveraging years of PySpark-based data > refinements. > Reproducing the same transformations in Pandas to match the DE > pipelines is > time-consuming and error-prone. > > > *The Ideal Scenario* > > To bridge the gap between Data Engineering and Machine Learning teams, the > ideal solution would: > > 1. > > *Unify the Tech Stack:* > - Both Data Engineers and ML teams would use *PySpark syntax* for data > transformations and feature engineering. This shared language would > simplify code maintenance, improve collaboration, and reduce the need > for > cross-training in multiple frameworks. > 2. > > *Flexible Execution Backend:* > - While using the same PySpark code, teams could choose the most > efficient execution engine based on their needs: > - *Data Engineers* would continue leveraging Spark's distributed > processing for large-scale data transformations. > - *ML Teams* could run the same PySpark transformations using > *Pandas* as the processing engine for faster, on-the-fly feature > generation in model training and API serving. > > This unified approach would eliminate redundant codebases, ensure > consistent feature definitions, and optimize performance across both batch > and real-time workflows. > > > *The Proposal:* > > *Introduce an API that allows PySpark syntax while processing DataFrame > using either Spark or Pandas depending on the session context.* > > > *Simple, but intuitive example:* > > import pyspark.sql.functions as F > > def silver(bronze_df): > return ( > bronze_df > .withColumnRenamed("bronze_col", "silver_col") > ) > > def gold(silver_df): > return ( > silver_df > .withColumnRenamed("silver_col", "gold_col") > .withColumn("gold_col", F.col("gold_col") + 1) > ) > > def features(gold_df): > return ( > gold_df > .withColumnRenamed("gold_col", "feature_col") > .withColumn("feature_col", F.col("feature_col") + 1) > ) > > # With the Spark Session (normal way of using PySpark) > spark = SparkSession.builder.master("local[1]").getOrCreate() > bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)]) > silver_df = silver(bronze_df) > gold_df = gold(silver_df) > features_df = features(gold_df) > features_df.show() > > # Proposed "Pandas Spark Session" > spark = SparkSession.builder.as_pandas.getOrCreate() > # This would execute the same transformations using Pandas under the hood. > > This would enable teams to share the same codebase while choosing the most > efficient processing engine. > > We've built and experimented in different data teams with a public > library, *flyipe > <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*, > that uses pandas_api() transformations, but can run using Pandas > <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>, > but it still requires ML teams to manage separate pipelines for Spark > dependencies. > > I’d love to hear thoughts from the community on this idea, and *if > there's a better approach to solving this issue*. > > Thanks, > José Müller >