HyukjinKwon commented on issue #26809: [SPARK-30185][SQL] Implement 
Dataset.tail API
URL: https://github.com/apache/spark/pull/26809#issuecomment-563503647
 
 
   > How much is this different from sorting in reverse and head()? in 
comparison this looks like it has to traverse the whole data set?
   
   At least it can drop the records at executor sides and it won't require a 
sort.
   
   > once the shuffle is involved, without ordering there should be no 
outstanding difference with head() as we don't guarantee ordering anyway, and 
with ordering the semantic would be same as sort with reverse order + head().
   
   Yes, I think this is a good point. It can be just a different way for the 
same thing with ordering. Without ordering, it's designed to follow its natural 
order, which is not guaranteed in many cases in Spark.
   
   One clear use case might be when it reads from external datasource. If I am 
not wrong, when we use Hadoop RDD, it respects its natural order. So, 
`spark.read.format("xml").load().tail(5)` case will work.
   Another case is local collection. If I am not wrong, the natural order is 
preserved.
   I am sure there are such more cases which I should identify.
   
   FWIW, Spark used to (unofficially) respect its natural order but it's broken 
after we started to consolidate small partitions into a big partition IIRC. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to