[ 
https://issues.apache.org/jira/browse/SPARK-9488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732112#comment-14732112
 ] 

Alexey Grishchenko commented on SPARK-9488:
-------------------------------------------

OrderedDict implementation in Python is very simple internally: it stores data 
in dict plus maintains the keys in array in the order they were inserted. I 
didn't see much difference between making a clear implementation of Row with 
OrderedDict vs just adding a dict into Row implementation that would match 
field name with its index in the tuple. I've written the code and called the 
new implementation IndexedRow. I've attached results of my benchmark to the 
ticket.

What I can say: on average, creation of the IndexedRow is 2 times slower than 
creation of Row. Of course it depends on the data, but for instance for 50 
fields and 1'000'000 objects IndexedRow implementation spends 56.6 seconds more 
than Row implementation. Average time of making lookup of a single field from 
Row for 50 fields and 1'000'000 objects is 1.9 seconds, for the same IndexedRow 
- 1.5 seconds, so you win 0.4 seconds. To cover 56.6 seconds difference you 
need to query 142 fields, which does not make sense.

For 100 fields situation is similar - 120 seconds creation overhead for 
IndexedRow and 1.4 seconds of win for each field lookup, which implies that 
IndexedRow would outperform Row if you would query 72 fields

In addition to this, you should count memory overhead for storing field names 
in each IndexedRow objects twice.

In my opinion, this optimization does not make sense. Feel free to comment if 
you have objections or comments to my implementation.

Code of the benchmark is available here: 
https://github.com/0x0FFF/experiments/blob/master/spark/python/benchmark_SPARK_9488.py

> pyspark.sql.types.Row very slow when using named arguments
> ----------------------------------------------------------
>
>                 Key: SPARK-9488
>                 URL: https://issues.apache.org/jira/browse/SPARK-9488
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.0
>         Environment: 
>            Reporter: Alexis Benoist
>              Labels: performance
>
> We can see that the implementation of the Row is accessing items in O(n).
> https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1217
> We could use an OrderedDict instead of a tuple to make the access time in 
> O(1). Can the keys be of an unhashable type?
> I'm ok to do the edit.
> Cheers,
> Alexis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to