[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

leifwalsh Thu, 15 Jun 2017 21:01:38 -0700

Github user leifwalsh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15821#discussion_r122359928
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1648,8 +1650,30 @@ def toPandas(self):
             0    2  Alice
             1    5    Bob
             """
    -        import pandas as pd
    -        return pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
    +        if self.sql_ctx.getConf("spark.sql.execution.arrow.enable", 
"false").lower() == "true":
    +            try:
    +                import pyarrow
    +                tables = self._collectAsArrow()
    +                table = pyarrow.concat_tables(tables)
    --- End diff --
    
    If tables is an empty list (e.g. if you load a dataset, filter the whole 
thing, and produce zero rows), `pyarrow.concat_tables` raises an exception 
rather than producing an empty table.  This should probably be fixed in arrow 
(cc @wesm) but we should be defensive here.  Probably should try to produce a 
`DataFrame` with the right schema but no rows if possible.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

Reply via email to