Re：Breaking the previous large-scale sort record with Spark

欧阳晋(欧阳晋) Mon, 13 Oct 2014 06:11:35 -0700

Great News!

Still some questions for this Sort benchmark
1 How many Map tasks run in the sort? I think the 2,50,000 is the # of reduce 
tasks. (if the unsorted file is read from HDFS and use the default 64MB chunk 
size, the # of map task will be about 1PB/64MB?)2 How long a single Reduce task 
run ? I think becuase the #of reduce task is very large , a single reduce task 
will not last a long time, so the record the reduce read in will also be very 
small?3 How much memory a single Reduce task use in one run? 
------------------------------------------------------------------
发件人：Matei Zaharia <[email protected]>
发送时间：2014年10月10日(星期五) 22:54
收件人：user <[email protected]>; dev <[email protected]>
主　题：Breaking the previous large-scale sort record with Spark


Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some pretty 
cool news for the project, which is that we've been able to use Spark to break 
MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer 
nodes. There's a detailed writeup at 
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by 
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few weeks, 
along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
addition, we'd really like to thank Amazon's EC2 team for providing the 
machines to make this possible. Finally, this result would of course not be 
possible without the many many other contributions, testing and feature 
requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to 
100-millisecond streaming and interactive queries is quite uncommon, and it's 
thanks to all of you folks that we are able to make this happen.

Matei
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re：Breaking the previous large-scale sort record with Spark

Reply via email to