Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/19459
Benchmarks for running in local mode 16 GB memory, i7-4800MQ CPU @ 2.70GHz
à 8 cores
using default Spark configuration
data is 10 columns of doubles with 100,000 rows
Code:
```python
import pandas as pd
import numpy as np
spark.conf.set("spark.sql.execution.arrow.enable", "false")
pdf = pd.DataFrame(np.random.rand(100000, 10), columns=list("abcdefghij"))
%timeit spark.createDataFrame(pdf)
spark.conf.set("spark.sql.execution.arrow.enable", "true")
%timeit spark.createDataFrame(pdf)
```
Without Arrow:
1 loop, best of 3: 7.21 s per loop
With Arrow:
10 loops, best of 3: 30.6 ms per loop
**Speedup of ~ 235x**
Also, tested creating up to 2 million rows with Arrow and results scale
linearly
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]