zero323 commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-551019678 > I agree that an update to `Row` is required and I'm happy to see discussions going in the > right direction. However, this will be part of Spark 3.x and for many Spark users using it in production it will be a long wait until this is going to happen. > > Do you think it makes sense to apply this change for Spark 2.x? I don't have strong opinion (especially I cannot speak for project policy, as I am usually less conservative than most of the folks around here), but I'd say that conditioned on lack of significant negative performance impact on more common use cases (i.e. `RDD[Row]` without schema), it is fine. I am also not completely against incorporating this in 3.0, just prefer to see how > > If not I'd like to see some memory profiling data (especially memory - timings might be actually better for now, as we skip all the nasty `obj[n]`, but that's not very meaningful*) first. > > I can try to do that. Do you have any example how to do proper timings with spark? For such simple case `timeit` / `memory_profiler` on `map` objects should be more than sufficient, unless you observe something truly unexpected. - Time-wise performance should be for now better, than existing approach, so unless you see unexpected slowdown (particularly with wide schemas) we should be good. - Memory-wise we might expect higher peak memory usage (let's say up to 100%, excluding constant interpreter overhead), as temporary `dicts` should swept away, once we get out of scope. > This was a workaround introduced by #14469. `tuple(obj)` relies on the order of fields - which for `Row` is alphabetically. If this doesn't correspond to the schema the order of the fields will be messed up. > > My change here is actually just doing the same workaround for schemas with fields that need some serialization. > My bad, makes sense. > I'm unclear if the proposed change will have any negative effect. The codepath should only be taken in cases that either failed (as in [SPARK-24915](https://issues.apache.org/jira/browse/SPARK-24915)) or might have silently mixed up the `.toInternal()` calls of different types. > > Even with sub-optimal performance, this would only improve the situation for users. > > Would you prefer to replace > > https://github.com/qudade/spark/blob/a52de2e4b258e7fecad4143e00f01df4b096a513/python/pyspark/sql/types.py#L603 > > ``` > return self.toInternal(obj.asDict()) > ``` > > by > > ``` > return tuple(f.toInternal(obj[n]) if c else obj[n] > for n, f, c in zip(self.names, self.fields, self._needConversion)) > ``` > > to reduce one indirection? No, As much as I don't like dictionaries here, there much better solution here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
