datapythonista commented on PR #750: URL: https://github.com/apache/datafusion-python/pull/750#issuecomment-2220954392
Thanks for the comments, and sorry if my feedback is not helpful. Just one last comment if you don't mind. Since the final goal seems to be adoption, and making things easier for Python users, the question that comes to my mind is whether this API wants to be a building block for other projects, or it wants to be a reasonable project for final users. My previous feedback was based on DataFusion being more for developers than for final users. And for example, nice DataFrame APIs for DataFusion being built as separate projects. If the idea is to make this API reasonable for final users, I think the approach here it makes more sense to me (not sure if I'd wrap everything, but some class surely would need it). For me, the main things that would make DataFusion as usable as other DataFrame libraries are summarized in this example: ```python import datafusion from datafusion import col, lit, functions as f import pyarrow # something like this would be implemented internally, so users can call `datafusion.read_*` def _read_parquet(*args, **kwargs): ctx = datafusion.SessionContext() return ctx.read_parquet(*args, **kwargs) datafusion.read_parquet = _read_parquet # creating an alias of `read_*` functions so users don't need to know about `SessionContext` when the defaults are fine df = (datafusion.read_parquet("buildings.parquet") .filter( # `.filter()` accepting multiple conditions (which will be an AND) instead of having to use `&` with its operator precedence problems col("is_offplan") == False, col("rooms") >= 2, # `.lit(2)` not being required, and Python literals working with operators ) .aggregate( [col("area_name_en")], [f.mean(col("has_parking").cast(float))], # `.cast()` accepting Python types, which would be internally converted to the PyArrow equivalent ) .select( col("area_name_en").alias("Area"), col("AVG(has_parking)").alias("Percentage of buildings with parking"), # removing the default `?table?` in column names, the column name was "AVG(?table?.has_parking)" ) ) ``` Implementing those things and similar ones would be worth the wrappers. I personally don't see it worth for users to have an idea of the project structure browsing the source code (when docs can be provided, and do a better job at that). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org