On 4/04/2016 11:25, David Betten wrote:
First the idea of loading all the data into a large hashmap to do the sort
tends to eliminate one very important thing and that's overlap.
Essentially, you read the entire input, conduct your massive hashsort, and
then write the output with no overlap of those three phases.  The approach
I prefer is an iterative process of sorting smaller amounts and writing
them to work files (either on disk or in memory) and then at end of input,
you almost immediately begin the output process of merging those sorted
strings.  This technique is very efficient and I can tell you many z/OS
customers are sorting tens to hundreds of gigabytes of data this way.

I wasn't actually suggesting sorting using a Hashmap, or that Java sort was more efficient than DFSORT (although the overhead of transferring data between Java<->DFSORT might make Java sort preferable when the data is already in Java).

I was more wondering whether collection classes like Hashmap could avoid the need to sort the data altogether, at which point the efficiency becomes moot. One common example given for sorting of data is to do grouping and totals, which can easily be implemented using a Hashmap with unordered data.

Second point I'd like to make also is related to overlap.  Sorting the
files allows downstream process to read them sequentially rather than
random gets from say VSAM or a data base.  When you read or write
sequentially, you have opportunities for I/O overlap along with blocking
and chaining.  So you can be reading the next set of data while your
program is processing the previous set of data.  This results in
considerable elapsed time savings and reduction in I/O overhead since more
data is transferred with each I/O.

This is more what I had in mind - other reasons for sorting data before processing. I can see that VSAM would benefit from reading in order. I'm not so sure that a database like DB2 stores data in order - DB2 might be fastest if you don't specify a sort order and just take it as it comes from the database. There's also the question of whether you save enough CPU and I/O to make up for the cost of the sort.

A Hashmap potentially allows you to read sequentially and match records between files, without caring about the order.

This doesn't really relate to the work I am doing. It was just speculation about whether Java etc. on z/OS provided opportunity to reduce CPU by implementing better algorithms, prompted by the comment about the amount of batch DFSORT people run.



--
Andrew Rowley
Black Hill Software
+61 413 302 386

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to