zero323 commented on issue #26118: [SPARK-24915][Python] Fix Row handling with 
Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-551019678
 
 
   > I agree that an update to `Row` is required and I'm happy to see 
discussions going in the
   > right direction. However, this will be part of Spark 3.x and for many 
Spark users using it in production it will be a long wait until this is going 
to happen.
   > 
   > Do you think it makes sense to apply this change for Spark 2.x?
   
   I don't have strong opinion (especially I cannot speak for project policy, 
as I am usually less conservative than most of the folks around here), but I'd 
say that conditioned on lack of significant negative performance impact on more 
common use cases (i.e. `RDD[Row]` without schema), it is  fine.
   
   I am also not completely against incorporating this in 3.0, just prefer to 
see how 
   
   > > If not I'd like to see some memory profiling data (especially memory - 
timings might be actually better for now, as we skip all the nasty `obj[n]`, 
but that's not very meaningful*) first.
   > 
   > I can try to do that. Do you have any example how to do proper timings 
with spark?
   
   For such simple case `timeit` / `memory_profiler` on `map` objects should be 
more than sufficient, unless you observe something truly unexpected. 
   
   - Time-wise performance should be for now better, than existing approach, so 
unless you see unexpected slowdown (particularly with wide schemas) we should 
be good.
   - Memory-wise we might expect higher peak memory usage (let's say up to 
100%, excluding constant interpreter overhead), as temporary `dicts` should 
swept away, once we get out of scope.
    
   
   
   > This was a workaround introduced by #14469. `tuple(obj)` relies on the 
order of fields - which for `Row` is alphabetically. If this doesn't correspond 
to the schema the order of the fields will be messed up.
   > 
   > My change here is actually just doing the same workaround for schemas with 
fields that need some serialization.
   > 
   
   My bad, makes sense.
   
   > I'm unclear if the proposed change will have any negative effect. The 
codepath should only be taken in cases that either failed (as in 
[SPARK-24915](https://issues.apache.org/jira/browse/SPARK-24915)) or might have 
silently mixed up the `.toInternal()` calls of different types.
   > 
   > Even with sub-optimal performance, this would only improve the situation 
for users.
   > 
   > Would you prefer to replace
   > 
   > 
https://github.com/qudade/spark/blob/a52de2e4b258e7fecad4143e00f01df4b096a513/python/pyspark/sql/types.py#L603
   > 
   > ```
   > return self.toInternal(obj.asDict())
   > ```
   > 
   > by
   > 
   > ```
   > return tuple(f.toInternal(obj[n]) if c else obj[n]
   >                              for n, f, c in zip(self.names, self.fields, 
self._needConversion))
   > ```
   > 
   > to reduce one indirection?
   
   No, As much as I don't like dictionaries here, there much better solution 
here.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to