Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/21546
### Performance Tests - createDataFrame
Tests run on a 4 node standalone cluster with 32 cores total,
14.04.1-Ubuntu and OpenJDK 8
measured wall clock time to execute `createDataFrame()` and get the first
record. Took the average best time of 5 runs/5 loops each.
Test code
```python
def run():
pdf = pd.DataFrame(np.random.rand(10000000, 10))
spark.createDataFrame(pdf).first()
for i in range(6):
start = time.time()
run()
elapsed = time.time() - start
gc.collect()
print("Run %d: %f" % (i, elapsed))
```
| Current Master | This PR |
|---------------------|------------|
6.234608 | 5.665641
6.32144 | 5.3475
6.527859 | 5.370803
6.95089 | 5.479151
6.235046 | 5.529167
| Avg Master | Avg This PR
-----------|--------------
6.4539686 | 5.4784524
#### Speedup of 1.178064192
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]