I would suggest using some feature like Lambda Architecture which is flexible enough. This is a dated diagram that illustrates better The architecture will consist of two main components plus Mapping
1) Batch Feature Engineering with Spark: Purpose: Process large datasets to generate features for model training - Use Spark for large-scale feature engineering during model training. - Store the feature engineering logic in a reusable format (e.g., a transformation pipeline). 2) Real-Time Feature Generation with Pandas: Purpose: Generate features on-the-fly for low-latency model serving - Use Pandas for low-latency, on-the-fly feature generation during model serving. - Translate Spark transformations into Pandas transformations for consistency. 3) Mapping Spark Transformations to Pandas: Purpose: Ensure consistency between batch and real-time feature engineering - Extract the transformation logic from the Spark pipeline. - Implement equivalent transformations in Pandas. - Use a shared library or configuration file to maintain consistency between Spark and Pandas transformations. [image: LambdaArchitecture_MD_BigQuery.png] HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 10 Feb 2025 at 12:39, José Müller <joseheliomul...@gmail.com> wrote: > Hi Mitch, > > All you said is well understood, but I believe you are missing the point, > the proposal is not to break Spark ways of processing, but to use spark as > a wrapper to process pandas, same as `pandas_api()`, but the inverse. > > Most of the cases to serve ML models require low latency (ms) and ideal is > to re-generate the features on the fly when receiving the payload via a > POST request, for example. > > In these cases the payloads are small and do not require the whole Spark > infrastructure, and it is better to process using local pandas instead of > Spark, so the proposal is Spark to "map" the > Pyspark Dataframes transformations and invoke Pandas transformations. > > I believe the point you are missing is when building the features for > model training, we have large datasets and therefore Spark is recommended, > but spark is not recommended to serve realtime and low latency requests. > > I know Koalas and pandas api, as described in the email, at the end. > However using koalas or pandas does not solve the problem of data teams > having to support 2 code syntaxes. > > Müller > > On Mon, 10 Feb 2025 at 09:27, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Regardless there are technical limitations here. For example pandas >> operates in-memory on a single machine like the driver, making it >> unsuitable for large-scale datasets that exceed the memory or processing >> capacity of a single node. Unlike pySpark, pandas cannot distribute >> workloads across a cluster, Have you looked at Koalas which I believe is >> currently integrated as pyspark.pandas? >> >> HTH >> >> Dr Mich Talebzadeh, >> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> >> >> On Mon, 10 Feb 2025 at 12:04, José Müller <joseheliomul...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I'm new to the Spark community—please let me know if this isn’t the >>> right forum for feature proposals. >>> >>> *About Me:* >>> I’ve spent over 10 years in data roles, from engineering to machine >>> learning and analytics. A recurring challenge I've encountered is the >>> disconnect between data engineering and ML teams when it comes to feature >>> engineering and data transformations using Spark. >>> >>> *The Problem:* >>> >>> There is a growing disconnect between Data Engineering and Machine >>> Learning teams due to differences in the tools and languages they use for >>> data transformations. >>> >>> - >>> >>> *Tooling Divergence:* >>> - *Data Engineering Teams* primarily use *Spark SQL* and *PySpark* >>> for building scalable data pipelines and transforming data across >>> different >>> layers (bronze, silver, gold). >>> - *Machine Learning Teams* typically rely on *Pandas* for feature >>> engineering, model training, and recreating features on-the-fly in >>> production environments. >>> - >>> >>> *Challenges Arising from This Divergence:* >>> 1. >>> >>> *Code Duplication & Maintenance Overhead:* >>> - ML teams often need to rebuild feature pipelines from raw data >>> or adapt data engineer outputs into Pandas, leading to duplicated >>> effort >>> and maintenance in two different codebases. >>> - This forces teams to maintain expertise in multiple >>> frameworks, increasing complexity and reducing cross-team >>> supportability. >>> 2. >>> >>> *Inconsistent Feature Definitions:* >>> - Transformations are implemented in different languages (PySpark >>> for DE, Pandas for ML), which causes *feature drift*—where the >>> same metrics or features calculated in separate environments >>> diverge over >>> time, leading to inconsistent results. >>> 3. >>> >>> *Performance Bottlenecks in Production:* >>> - While PySpark can regenerate features for serving models via >>> APIs, it introduces inefficiencies: >>> - *Overhead for Small Payloads:* Spark's distributed >>> processing is unnecessary for small, real-time requests and is >>> slower >>> compared to Pandas. >>> - *Latency Issues:* Spark struggles to deliver >>> millisecond-level responses required in production APIs. >>> - *Operational Complexity:* Maintaining Spark within API >>> environments adds unnecessary overhead. >>> 4. >>> >>> *Data Refinement Gap:* >>> - Since ML teams are often formed after data engineering teams, >>> they lag behind in leveraging years of PySpark-based data >>> refinements. >>> Reproducing the same transformations in Pandas to match the DE >>> pipelines is >>> time-consuming and error-prone. >>> >>> >>> *The Ideal Scenario* >>> >>> To bridge the gap between Data Engineering and Machine Learning teams, >>> the ideal solution would: >>> >>> 1. >>> >>> *Unify the Tech Stack:* >>> - Both Data Engineers and ML teams would use *PySpark syntax* for >>> data transformations and feature engineering. This shared language >>> would >>> simplify code maintenance, improve collaboration, and reduce the need >>> for >>> cross-training in multiple frameworks. >>> 2. >>> >>> *Flexible Execution Backend:* >>> - While using the same PySpark code, teams could choose the most >>> efficient execution engine based on their needs: >>> - *Data Engineers* would continue leveraging Spark's >>> distributed processing for large-scale data transformations. >>> - *ML Teams* could run the same PySpark transformations using >>> *Pandas* as the processing engine for faster, on-the-fly >>> feature generation in model training and API serving. >>> >>> This unified approach would eliminate redundant codebases, ensure >>> consistent feature definitions, and optimize performance across both batch >>> and real-time workflows. >>> >>> >>> *The Proposal:* >>> >>> *Introduce an API that allows PySpark syntax while processing DataFrame >>> using either Spark or Pandas depending on the session context.* >>> >>> >>> *Simple, but intuitive example:* >>> >>> import pyspark.sql.functions as F >>> >>> def silver(bronze_df): >>> return ( >>> bronze_df >>> .withColumnRenamed("bronze_col", "silver_col") >>> ) >>> >>> def gold(silver_df): >>> return ( >>> silver_df >>> .withColumnRenamed("silver_col", "gold_col") >>> .withColumn("gold_col", F.col("gold_col") + 1) >>> ) >>> >>> def features(gold_df): >>> return ( >>> gold_df >>> .withColumnRenamed("gold_col", "feature_col") >>> .withColumn("feature_col", F.col("feature_col") + 1) >>> ) >>> >>> # With the Spark Session (normal way of using PySpark) >>> spark = SparkSession.builder.master("local[1]").getOrCreate() >>> bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)]) >>> silver_df = silver(bronze_df) >>> gold_df = gold(silver_df) >>> features_df = features(gold_df) >>> features_df.show() >>> >>> # Proposed "Pandas Spark Session" >>> spark = SparkSession.builder.as_pandas.getOrCreate() >>> # This would execute the same transformations using Pandas under the hood. >>> >>> This would enable teams to share the same codebase while choosing the >>> most efficient processing engine. >>> >>> We've built and experimented in different data teams with a public >>> library, *flyipe >>> <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*, >>> that uses pandas_api() transformations, but can run using Pandas >>> <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>, >>> but it still requires ML teams to manage separate pipelines for Spark >>> dependencies. >>> >>> I’d love to hear thoughts from the community on this idea, and *if >>> there's a better approach to solving this issue*. >>> >>> Thanks, >>> José Müller >>> >> > > -- > José Müller >