Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

José Müller Mon, 10 Feb 2025 06:35:17 -0800

Hi Wenchen,

"Interesting, so this is PySpark on pandas which is the reverse of Koalas."
yes, exactly


"If performance is the only problem, maybe we can improve local-mode Spark
performance to be on par with these single-node engines. + @Hyukjin Kwon
<gurwls...@gmail.com> "
Not sure if we could gain much performance, even in single nodes. I did
some tests and tried to run spark as lean as possible, but the could not
get very simple transformation like column rename or sum to run under a
second, not an expert either, may there is a single-node setup that could
achieve performance as fast as pandas?

We have to consider the time to create a spark session in API that takes a
considerable amount of time to create it. Users could try to minimise that
by creating some kind of SparkSessioFactory, but it will require more from
the users and they might not have much flexibility when hosting their
models.

Thanks for the reply.

Müller

On Mon, 10 Feb 2025 at 11:09, Wenchen Fan <cloud0...@gmail.com> wrote:

> Interesting, so this is PySpark on pandas which is the reverse of Koalas.
>
> If performance is the only problem, maybe we can improve local-mode Spark
> performance to be on par with these single-node engines. + @Hyukjin Kwon
> <gurwls...@gmail.com>
>
> On Mon, Feb 10, 2025 at 8:40 PM José Müller <joseheliomul...@gmail.com>
> wrote:
>
>> Hi Mitch,
>>
>> All you said is well understood, but I believe you are missing the point,
>> the proposal is not to break Spark ways of processing, but to use spark as
>> a wrapper to process pandas, same as `pandas_api()`, but the inverse.
>>
>> Most of the cases to serve ML models require low latency (ms) and ideal
>> is to re-generate the features on the fly when receiving the payload via a
>> POST request, for example.
>>
>> In these cases the payloads are small and do not require the whole Spark
>> infrastructure, and it is better to process using local pandas instead of
>> Spark, so the proposal is Spark to "map" the
>> Pyspark Dataframes transformations and invoke Pandas transformations.
>>
>> I believe the point you are missing is when building the features for
>> model training, we have large datasets and therefore Spark is recommended,
>> but spark is not recommended to serve realtime and low latency requests.
>>
>> I know Koalas and pandas api, as described in the email, at the end.
>> However using koalas or pandas does not solve the problem of data teams
>> having to support 2 code syntaxes.
>>
>> Müller
>>
>> On Mon, 10 Feb 2025 at 09:27, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Regardless there are technical limitations here. For example pandas
>>> operates in-memory on a single machine like the driver, making it
>>> unsuitable for large-scale datasets that exceed the memory or processing
>>> capacity of a single node. Unlike pySpark, pandas cannot distribute
>>> workloads across a cluster, Have you looked at Koalas which I believe
>>> is currently integrated as pyspark.pandas?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh,
>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 10 Feb 2025 at 12:04, José Müller <joseheliomul...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm new to the Spark community—please let me know if this isn’t the
>>>> right forum for feature proposals.
>>>>
>>>> *About Me:*
>>>> I’ve spent over 10 years in data roles, from engineering to machine
>>>> learning and analytics. A recurring challenge I've encountered is the
>>>> disconnect between data engineering and ML teams when it comes to feature
>>>> engineering and data transformations using Spark.
>>>>
>>>> *The Problem:*
>>>>
>>>> There is a growing disconnect between Data Engineering and Machine
>>>> Learning teams due to differences in the tools and languages they use for
>>>> data transformations.
>>>>
>>>>    -
>>>>
>>>>    *Tooling Divergence:*
>>>>    - *Data Engineering Teams* primarily use *Spark SQL* and *PySpark*
>>>>       for building scalable data pipelines and transforming data across 
>>>> different
>>>>       layers (bronze, silver, gold).
>>>>       - *Machine Learning Teams* typically rely on *Pandas* for
>>>>       feature engineering, model training, and recreating features 
>>>> on-the-fly in
>>>>       production environments.
>>>>    -
>>>>
>>>>    *Challenges Arising from This Divergence:*
>>>>    1.
>>>>
>>>>       *Code Duplication & Maintenance Overhead:*
>>>>       - ML teams often need to rebuild feature pipelines from raw data
>>>>          or adapt data engineer outputs into Pandas, leading to duplicated 
>>>> effort
>>>>          and maintenance in two different codebases.
>>>>          - This forces teams to maintain expertise in multiple
>>>>          frameworks, increasing complexity and reducing cross-team 
>>>> supportability.
>>>>       2.
>>>>
>>>>       *Inconsistent Feature Definitions:*
>>>>       - Transformations are implemented in different languages
>>>>          (PySpark for DE, Pandas for ML), which causes *feature 
>>>> drift*—where
>>>>          the same metrics or features calculated in separate environments 
>>>> diverge
>>>>          over time, leading to inconsistent results.
>>>>       3.
>>>>
>>>>       *Performance Bottlenecks in Production:*
>>>>       - While PySpark can regenerate features for serving models via
>>>>          APIs, it introduces inefficiencies:
>>>>             - *Overhead for Small Payloads:* Spark's distributed
>>>>             processing is unnecessary for small, real-time requests and is 
>>>> slower
>>>>             compared to Pandas.
>>>>             - *Latency Issues:* Spark struggles to deliver
>>>>             millisecond-level responses required in production APIs.
>>>>             - *Operational Complexity:* Maintaining Spark within API
>>>>             environments adds unnecessary overhead.
>>>>          4.
>>>>
>>>>       *Data Refinement Gap:*
>>>>       - Since ML teams are often formed after data engineering teams,
>>>>          they lag behind in leveraging years of PySpark-based data 
>>>> refinements.
>>>>          Reproducing the same transformations in Pandas to match the DE 
>>>> pipelines is
>>>>          time-consuming and error-prone.
>>>>
>>>>
>>>> *The Ideal Scenario*
>>>>
>>>> To bridge the gap between Data Engineering and Machine Learning teams,
>>>> the ideal solution would:
>>>>
>>>>    1.
>>>>
>>>>    *Unify the Tech Stack:*
>>>>    - Both Data Engineers and ML teams would use *PySpark syntax* for
>>>>       data transformations and feature engineering. This shared language 
>>>> would
>>>>       simplify code maintenance, improve collaboration, and reduce the 
>>>> need for
>>>>       cross-training in multiple frameworks.
>>>>    2.
>>>>
>>>>    *Flexible Execution Backend:*
>>>>    - While using the same PySpark code, teams could choose the most
>>>>       efficient execution engine based on their needs:
>>>>          - *Data Engineers* would continue leveraging Spark's
>>>>          distributed processing for large-scale data transformations.
>>>>          - *ML Teams* could run the same PySpark transformations using
>>>>          *Pandas* as the processing engine for faster, on-the-fly
>>>>          feature generation in model training and API serving.
>>>>
>>>> This unified approach would eliminate redundant codebases, ensure
>>>> consistent feature definitions, and optimize performance across both batch
>>>> and real-time workflows.
>>>>
>>>>
>>>> *The Proposal:*
>>>>
>>>> *Introduce an API that allows PySpark syntax while processing DataFrame
>>>> using either Spark or Pandas depending on the session context.*
>>>>
>>>>
>>>> *Simple, but intuitive example:*
>>>>
>>>> import pyspark.sql.functions as F
>>>>
>>>> def silver(bronze_df):
>>>>     return (
>>>>         bronze_df
>>>>         .withColumnRenamed("bronze_col", "silver_col")
>>>>     )
>>>>
>>>> def gold(silver_df):
>>>>     return (
>>>>         silver_df
>>>>         .withColumnRenamed("silver_col", "gold_col")
>>>>         .withColumn("gold_col", F.col("gold_col") + 1)
>>>>     )
>>>>
>>>> def features(gold_df):
>>>>     return (
>>>>         gold_df
>>>>         .withColumnRenamed("gold_col", "feature_col")
>>>>         .withColumn("feature_col", F.col("feature_col") + 1)
>>>>     )
>>>>
>>>> # With the Spark Session (normal way of using PySpark)
>>>> spark = SparkSession.builder.master("local[1]").getOrCreate()
>>>> bronze_df = spark.createDataFrame(schema=("bronze_col",), data=[(1,)])
>>>> silver_df = silver(bronze_df)
>>>> gold_df = gold(silver_df)
>>>> features_df = features(gold_df)
>>>> features_df.show()
>>>>
>>>> # Proposed "Pandas Spark Session"
>>>> spark = SparkSession.builder.as_pandas.getOrCreate()
>>>> # This would execute the same transformations using Pandas under the hood.
>>>>
>>>> This would enable teams to share the same codebase while choosing the
>>>> most efficient processing engine.
>>>>
>>>> We've built and experimented in different data teams with a public
>>>> library, *flyipe
>>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*,
>>>> that uses pandas_api() transformations, but can run using Pandas
>>>> <https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>,
>>>> but it still requires ML teams to manage separate pipelines for Spark
>>>> dependencies.
>>>>
>>>> I’d love to hear thoughts from the community on this idea, and *if
>>>> there's a better approach to solving this issue*.
>>>>
>>>> Thanks,
>>>> José Müller
>>>>
>>>
>>
>> --
>> José Müller
>>
>

-- 
José Müller

Re: [PROPOSAL] Unified PySpark-Pandas API to Bridge Data Engineering and ML Workflows

Reply via email to