Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Well, technically yes. But I would say it's a little more than that.
The current approach to deserialization of `Seq`s is to copy the data into
an array, construct a `WrappedArray` (which extends `Seq`) and optionally copy
the data (again) into the new collection. This needs to go through all the
elements twice if anything other than a `Seq` (or `WrappedArray` directly) is
requested.
This PR takes a more straightforward approach and constructs a mutable
collection builder (which is defined for every Scala collection), adds the
elements to it and retrieves the result.
I assumed this would be a performance improvement and am quite surprised
that there is no difference. But I think this might be due to the improvement
being so small as to drown in the usual operation overhead. Sadly, I do not
have the resources to measure the operation on larger amounts of data, having
the benchmarks fail on larger collections on my setup.
I am also not that familiar with Spark's internals to determine whether
@kiszk's suggestion would improve operations in other ways, eg. during
operations in the cluster environment. That is why I decided to implement it
but keep it separate from my proposed changes for the time being.
As to other benefits that this PR would bring other than peformance
improvements:
* I would like to implement Java `List`s support in a manner similar to
what I did with `Map`s in #16986 (I would also ask you to take a look at that
when you have the time) by just slightly altering the code.
* I didn't know this when I implemented this but Scala 2.13 will introduce
a major [collections
rework](https://www.scala-lang.org/blog/2017/02/28/collections-rework.html)
that will change the way the `to` method (used for conversions in the current
solution) works. This will require a rewrite whereas I believe the method used
here will remain largely the same.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]