Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Mich Talebzadeh Mon, 10 Feb 2025 08:55:32 -0800

I would suggest using some feature like Lambda Architecture which is
flexible enough. This is a dated diagram that illustrates better
The architecture will consist of two main components plus Mapping


1) Batch Feature Engineering with Spark: Purpose: Process large datasets to
generate features for model training

   - Use Spark for large-scale feature engineering during model training.
   - Store the feature engineering logic in a reusable format (e.g., a
   transformation pipeline).

2) Real-Time Feature Generation with Pandas: Purpose: Generate features
on-the-fly for low-latency model serving

   - Use Pandas for low-latency, on-the-fly feature generation during model
   serving.
   - Translate Spark transformations into Pandas transformations for
   consistency.

3) Mapping Spark Transformations to Pandas: Purpose: Ensure consistency
between batch and real-time feature engineering

   - Extract the transformation logic from the Spark pipeline.
   - Implement equivalent transformations in Pandas.
   - Use a shared library or configuration file to maintain consistency
   between Spark and Pandas transformations.


[image: LambdaArchitecture_MD_BigQuery.png]

HTH

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


On Mon, 10 Feb 2025 at 12:39, José Müller <joseheliomul...@gmail.com> wrote:

> Hi Mitch,
>
> All you said is well understood, but I believe you are missing the point,
> the proposal is not to break Spark ways of processing, but to use spark as
> a wrapper to process pandas, same as `pandas_api()`, but the inverse.
>
> Most of the cases to serve ML models require low latency (ms) and ideal is
> to re-generate the features on the fly when receiving the payload via a
> POST request, for example.
>
> In these cases the payloads are small and do not require the whole Spark
> infrastructure, and it is better to process using local pandas instead of
> Spark, so the proposal is Spark to "map" the
> Pyspark Dataframes transformations and invoke Pandas transformations.
>
> I believe the point you are missing is when building the features for
> model training, we have large datasets and therefore Spark is recommended,
> but spark is not recommended to serve realtime and low latency requests.
>
> I know Koalas and pandas api, as described in the email, at the end.
> However using koalas or pandas does not solve the problem of data teams
> having to support 2 code syntaxes.
>
> Müller
>
> On Mon, 10 Feb 2025 at 09:27, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Regardless there are technical limitations here. For example pandas
>> operates in-memory on a single machine like the driver, making it
>> unsuitable for large-scale datasets that exceed the memory or processing
>> capacity of a single node. Unlike pySpark, pandas cannot distribute
>> workloads across a cluster, Have you looked at Koalas which I believe is
>> currently integrated as pyspark.pandas?
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Mon, 10 Feb 2025 at 12:04, José Müller <joseheliomul...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm new to the Spark community—please let me know if this isn’t the
>>> right forum for feature proposals.
>>>
>>> *About Me:*
>>> I’ve spent over 10 years in data roles, from engineering to machine
>>> learning and analytics. A recurring challenge I've encountered is the
>>> disconnect between data engineering and ML teams when it comes to feature
>>> engineering and data transformations using Spark.
>>>
>>> *The Problem:*
>>>
>>> There is a growing disconnect between Data Engineering and Machine
>>> Learning teams due to differences in the tools and languages they use for
>>> data transformations.
>>>
>>>    -
>>>
>>>    *Tooling Divergence:*
>>>    - *Data Engineering Teams* primarily use *Spark SQL* and *PySpark*
>>>       for building scalable data pipelines and transforming data across 
>>> different
>>>       layers (bronze, silver, gold).
>>>       - *Machine Learning Teams* typically rely on *Pandas* for feature
>>>       engineering, model training, and recreating features on-the-fly in
>>>       production environments.
>>>    -
>>>
>>>    *Challenges Arising from This Divergence:*
>>>    1.
>>>
>>>       *Code Duplication & Maintenance Overhead:*
>>>       - ML teams often need to rebuild feature pipelines from raw data
>>>          or adapt data engineer outputs into Pandas, leading to duplicated 
>>> effort
>>>          and maintenance in two different codebases.
>>>          - This forces teams to maintain expertise in multiple
>>>          frameworks, increasing complexity and reducing cross-team 
>>> supportability.
>>>       2.
>>>
>>>       *Inconsistent Feature Definitions:*
>>>       - Transformations are implemented in different languages (PySpark
>>>          for DE, Pandas for ML), which causes *feature drift*—where the
>>>          same metrics or features calculated in separate environments 
>>> diverge over
>>>          time, leading to inconsistent results.
>>>       3.
>>>
>>>       *Performance Bottlenecks in Production:*
>>>       - While PySpark can regenerate features for serving models via
>>>          APIs, it introduces inefficiencies:
>>>             - *Overhead for Small Payloads:* Spark's distributed
>>>             processing is unnecessary for small, real-time requests and is 
>>> slower
>>>             compared to Pandas.
>>>             - *Latency Issues:* Spark struggles to deliver
>>>             millisecond-level responses required in production APIs.
>>>             - *Operational Complexity:* Maintaining Spark within API
>>>             environments adds unnecessary overhead.
>>>          4.
>>>
>>>       *Data Refinement Gap:*
>>>       - Since ML teams are often formed after data engineering teams,
>>>          they lag behind in leveraging years of PySpark-based data 
>>> refinements.
>>>          Reproducing the same transformations in Pandas to match the DE 
>>> pipelines is
>>>          time-consuming and error-prone.
>>>
>>>
>>> *The Ideal Scenario*
>>>
>>> To bridge the gap between Data Engineering and Machine Learning teams,
>>> the ideal solution would:
>>>
>>>    1.
>>>
>>>    *Unify the Tech Stack:*
>>>    - Both Data Engineers and ML teams would use *PySpark syntax* for
>>>       data transformations and feature engineering. This shared language 
>>> would
>>>       simplify code maintenance, improve collaboration, and reduce the need 
>>> for
>>>       cross-training in multiple frameworks.
>>>    2.
>>>
>>>    *Flexible Execution Backend:*
>>>    - While using the same PySpark code, teams could choose the most
>>>       efficient execution engine based on their needs:
>>>          - *Data Engineers* would continue leveraging Spark's
>>>          distributed processing for large-scale data transformations.
>>>          - *ML Teams* could run the same PySpark transformations using
>>>          *Pandas* as the processing engine for faster, on-the-fly
>>>          feature generation in model training and API serving.
>>>
>>> This unified approach would eliminate redundant codebases, ensure
>>> consistent feature definitions, and optimize performance across both batch
>>> and real-time workflows.
>>>
>>>
>>> *The Proposal:*
>>>
>>> *Introduce an API that allows PySpark syntax while processing DataFrame
>>> using either Spark or Pandas depending on the session context.*
>>>
>>>
>>> *Simple, but intuitive example:*
>>>
>>> import pyspark.sql.functions as F
>>>
>>> def silver(bronze_df):
>>>     return (
>>>         bronze_df
>>>         .withColumnRenamed("bronze_col", "silver_col")
>>>     )
>>>
>>> def gold(silver_df):
>>>     return (
>>>         silver_df
>>>         .withColumnRenamed("silver_col", "gold_col")
>>>         .withColumn("gold_col", F.col("gold_col") + 1)
>>>     )
>>>
>>> def features(gold_df):
>>>     return (
>>>         gold_df
>>>         .withColumnRenamed("gold_col", "feature_col")
>>>         .withColumn("feature_col", F.col("feature_col") + 1)
>>>     )
>>>
>>> # With the Spark Session (normal way of using PySpark)
>>> spark = SparkSession.builder.master("local[1]").getOrCreate()
>>> bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)])
>>> silver_df = silver(bronze_df)
>>> gold_df = gold(silver_df)
>>> features_df = features(gold_df)
>>> features_df.show()
>>>
>>> # Proposed "Pandas Spark Session"
>>> spark = SparkSession.builder.as_pandas.getOrCreate()
>>> # This would execute the same transformations using Pandas under the hood.
>>>
>>> This would enable teams to share the same codebase while choosing the
>>> most efficient processing engine.
>>>
>>> We've built and experimented in different data teams with a public
>>> library, *flyipe
>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*,
>>> that uses pandas_api() transformations, but can run using Pandas
>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>,
>>> but it still requires ML teams to manage separate pipelines for Spark
>>> dependencies.
>>>
>>> I’d love to hear thoughts from the community on this idea, and *if
>>> there's a better approach to solving this issue*.
>>>
>>> Thanks,
>>> José Müller
>>>
>>
>
> --
> José Müller
>

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Reply via email to