Re: Why sort (was Microprocessor Optimization Primer)

Andrew Rowley Sun, 03 Apr 2016 23:50:26 -0700

On 4/04/2016 11:25, David Betten wrote:

First the idea of loading all the data into a large hashmap to do the sort
tends to eliminate one very important thing and that's overlap.
Essentially, you read the entire input, conduct your massive hashsort, and
then write the output with no overlap of those three phases.  The approach
I prefer is an iterative process of sorting smaller amounts and writing
them to work files (either on disk or in memory) and then at end of input,
you almost immediately begin the output process of merging those sorted
strings.  This technique is very efficient and I can tell you many z/OS
customers are sorting tens to hundreds of gigabytes of data this way.

I wasn't actually suggesting sorting using a Hashmap, or that Java sortwas more efficient than DFSORT (although the overhead of transferringdata between Java<->DFSORT might make Java sort preferable when the datais already in Java).

I was more wondering whether collection classes like Hashmap could avoidthe need to sort the data altogether, at which point the efficiencybecomes moot. One common example given for sorting of data is to dogrouping and totals, which can easily be implemented using a Hashmapwith unordered data.

Second point I'd like to make also is related to overlap.  Sorting the
files allows downstream process to read them sequentially rather than
random gets from say VSAM or a data base.  When you read or write
sequentially, you have opportunities for I/O overlap along with blocking
and chaining.  So you can be reading the next set of data while your
program is processing the previous set of data.  This results in
considerable elapsed time savings and reduction in I/O overhead since more
data is transferred with each I/O.

This is more what I had in mind - other reasons for sorting data beforeprocessing. I can see that VSAM would benefit from reading in order. I'mnot so sure that a database like DB2 stores data in order - DB2 might befastest if you don't specify a sort order and just take it as it comesfrom the database. There's also the question of whether you save enoughCPU and I/O to make up for the cost of the sort.

A Hashmap potentially allows you to read sequentially and match recordsbetween files, without caring about the order.

This doesn't really relate to the work I am doing. It was justspeculation about whether Java etc. on z/OS provided opportunity toreduce CPU by implementing better algorithms, prompted by the commentabout the amount of batch DFSORT people run.




--
Andrew Rowley
Black Hill Software
+61 413 302 386

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Why sort (was Microprocessor Optimization Primer)

Reply via email to