datapythonista commented on PR #750:
URL: 
https://github.com/apache/datafusion-python/pull/750#issuecomment-2220954392

   Thanks for the comments, and sorry if my feedback is not helpful. Just one 
last comment if you don't mind.
   
   Since the final goal seems to be adoption, and making things easier for 
Python users, the question that comes to my mind is whether this API wants to 
be a building block for other projects, or it wants to be a reasonable project 
for final users.
   
   My previous feedback was based on DataFusion being more for developers than 
for final users. And for example, nice DataFrame APIs for DataFusion being 
built as separate projects. If the idea is to make this API reasonable for 
final users, I think the approach here it makes more sense to me (not sure if 
I'd wrap everything, but some class surely would need it).
   
   For me, the main things that would make DataFusion as usable as other 
DataFrame libraries are summarized in this example:
   
   ```python
   import datafusion
   from datafusion import col, lit, functions as f
   import pyarrow
   
   
   # something like this would be implemented internally, so users can call 
`datafusion.read_*`
   def _read_parquet(*args, **kwargs):
       ctx = datafusion.SessionContext()
       return ctx.read_parquet(*args, **kwargs)
   datafusion.read_parquet = _read_parquet  # creating an alias of `read_*` 
functions so users don't need to know about `SessionContext` when the defaults 
are fine
   
   
   df = (datafusion.read_parquet("buildings.parquet")
                   .filter(  # `.filter()` accepting multiple conditions (which 
will be an AND) instead of having to use `&` with its operator precedence 
problems
                       col("is_offplan") == False,
                       col("rooms") >= 2,  # `.lit(2)` not being required, and 
Python literals working with operators
                   )
                   .aggregate(
                       [col("area_name_en")],
                       [f.mean(col("has_parking").cast(float))],  # `.cast()` 
accepting Python types, which would be internally converted to the PyArrow 
equivalent
                   )
                   .select(
                       col("area_name_en").alias("Area"),
                       col("AVG(has_parking)").alias("Percentage of buildings 
with parking"),  # removing the default `?table?` in column names, the column 
name was "AVG(?table?.has_parking)"
                   )
        )
   ```
   
   Implementing those things and similar ones would be worth the wrappers. I 
personally don't see it worth for users to have an idea of the project 
structure browsing the source code (when docs can be provided, and do a better 
job at that).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to