There's also PolarSpark that offers PySpark API with Polars engine https://github.com/khalidmammadov/polarspark
PS. It's also in PyPi On Wed, 12 Feb 2025, 19:56 Sem, <ssinche...@apache.org> wrote: > DuckDB provides "PySpark syntax" on top of fast single node engine: > > https://duckdb.org/docs/clients/python/spark_api.html > > As I remember, DuckDB is much faster than pandas on a single node and it > already provides a spark-compatible API. > > On 2/10/25 1:02 PM, José Müller wrote: > > > > Hi all, > > > > I'm new to the Spark community—please let me know if this isn’t the > > right forum for feature proposals. > > > > *About Me:* > > I’ve spent over 10 years in data roles, from engineering to machine > > learning and analytics. A recurring challenge I've encountered is the > > disconnect between data engineering and ML teams when it comes to > > feature engineering and data transformations using Spark. > > > > *The Problem:* > > > > There is a growing disconnect between Data Engineering and Machine > > Learning teams due to differences in the tools and languages they use > > for data transformations. > > > > * > > > > *Tooling Divergence:* > > > > o *Data Engineering Teams* primarily use *Spark SQL* and > > *PySpark* for building scalable data pipelines and > > transforming data across different layers (bronze, silver, gold). > > o *Machine Learning Teams* typically rely on *Pandas* for > > feature engineering, model training, and recreating features > > on-the-fly in production environments. > > * > > > > *Challenges Arising from This Divergence:* > > > > 1. > > > > *Code Duplication & Maintenance Overhead:* > > > > o ML teams often need to rebuild feature pipelines from raw > > data or adapt data engineer outputs into Pandas, leading > > to duplicated effort and maintenance in two different > > codebases. > > o This forces teams to maintain expertise in multiple > > frameworks, increasing complexity and reducing cross-team > > supportability. > > 2. > > > > *Inconsistent Feature Definitions:* > > > > o Transformations are implemented in different languages > > (PySpark for DE, Pandas for ML), which causes *feature > > drift*—where the same metrics or features calculated in > > separate environments diverge over time, leading to > > inconsistent results. > > 3. > > > > *Performance Bottlenecks in Production:* > > > > o While PySpark can regenerate features for serving models > > via APIs, it introduces inefficiencies: > > + *Overhead for Small Payloads:* Spark's distributed > > processing is unnecessary for small, real-time > > requests and is slower compared to Pandas. > > + *Latency Issues:* Spark struggles to deliver > > millisecond-level responses required in production APIs. > > + *Operational Complexity:* Maintaining Spark within API > > environments adds unnecessary overhead. > > 4. > > > > *Data Refinement Gap:* > > > > o Since ML teams are often formed after data engineering > > teams, they lag behind in leveraging years of > > PySpark-based data refinements. Reproducing the same > > transformations in Pandas to match the DE pipelines is > > time-consuming and error-prone. > > > > * > > * > > *The Ideal Scenario* > > > > To bridge the gap between Data Engineering and Machine Learning teams, > > the ideal solution would: > > > > 1. > > > > *Unify the Tech Stack:* > > > > * Both Data Engineers and ML teams would use *PySpark > > syntax* for data transformations and feature engineering. This > > shared language would simplify code maintenance, improve > > collaboration, and reduce the need for cross-training in > > multiple frameworks. > > 2. > > > > *Flexible Execution Backend:* > > > > * While using the same PySpark code, teams could choose the most > > efficient execution engine based on their needs: > > o *Data Engineers* would continue leveraging Spark's > > distributed processing for large-scale data transformations. > > o *ML Teams* could run the same PySpark transformations > > using *Pandas* as the processing engine for faster, > > on-the-fly feature generation in model training and API > > serving. > > > > This unified approach would eliminate redundant codebases, ensure > > consistent feature definitions, and optimize performance across both > > batch and real-time workflows. > > > > > > *The Proposal:* > > > > *Introduce an API that allows PySpark syntax while processing > > DataFrame using either Spark or Pandas depending on the session context.* > > > > * > > * > > > > *Simple, but intuitive example:* > > > > import pyspark.sql.functionsas F > > > > def silver(bronze_df): > > return ( > > bronze_df > > .withColumnRenamed("bronze_col","silver_col") > > ) > > > > def gold(silver_df): > > return ( > > silver_df > > .withColumnRenamed("silver_col","gold_col") > > .withColumn("gold_col", F.col("gold_col") +1) > > ) > > > > def features(gold_df): > > return ( > > gold_df > > .withColumnRenamed("gold_col","feature_col") > > .withColumn("feature_col", F.col("feature_col") +1) > > ) > > > > # With the Spark Session (normal way of using PySpark) spark = > SparkSession.builder.master("local[1]").getOrCreate() > > bronze_df = spark.createDataFrame(schema=("bronze_col",),data=[(1,)]) > > silver_df = silver(bronze_df) > > gold_df = gold(silver_df) > > features_df = features(gold_df) > > features_df.show() > > > > # Proposed "Pandas Spark Session" spark = > SparkSession.builder.as_pandas.getOrCreate() > > # This would execute the same transformations using Pandas under the > hood. > > > > This would enable teams to share the same codebase while choosing the > > most efficient processing engine. > > > > We've built and experimented in different data teams with a public > > library, *flyipe > > < > https://flypipe.github.io/flypipe/html/release/4.1.0/index.html#what-flypipe-aims-to-facilitate>*, > > > that uses |pandas_api()| transformations, but can run using Pandas > > < > https://flypipe.github.io/flypipe/html/release/4.1.0/notebooks/tutorial/multiple-node-types.html#4.-pandas_on_spark-nodes-as-pandas>, > > > but it still requires ML teams to manage separate pipelines for Spark > > dependencies. > > > > I’d love to hear thoughts from the community on this idea, and *if > > there's a better approach to solving this issue*. > > > > Thanks, > > José Müller > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >