the classic which is the fastest way to sort question

2009-11-09 Thread Michael Alipio
Hi,

i'm planning to sort an input file (which was File::Slurp'ed, most likely 
megabyte-sized file) in various ways. I did some readings and learned several 
methods
that people have come up with in recent years. So to summarize, the default 
sort is fast (uses quick sort), explicit (using sub) is a bit slower, other 
method
that uses caching is faster. Then there's Schwartzian Transform and a packed 
version by Guttman. Seems like everything is clear. Guttman is the fastest,
until I went to cpan. Found Sort::Key, which claims to be the fastest, even 
faster that ST, GRT. Now, before someone says, why not try each one and
see for yourself (which doing such could be another subject for me to learn), 
my question is this: if such faster sorting algorithms exist, why don't they 
just replace the default sort function in Perl?

And for the classical question, given my situation (in combination with 
File::Slurp), which is fastest sort method? (I hope somebody includes this in 
perlfaq in the future).


  

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: the classic which is the fastest way to sort question

2009-11-09 Thread Uri Guttman
 MA == Michael Alipio daem0n...@yahoo.com writes:

  MA i'm planning to sort an input file (which was File::Slurp'ed, most
  MA likely megabyte-sized file) in various ways. I did some readings
  MA and learned several methods that people have come up with in
  MA recent years. So to summarize, the default sort is fast (uses
  MA quick sort), explicit (using sub) is a bit slower, other method
  MA that uses caching is faster. Then there's Schwartzian Transform
  MA and a packed version by Guttman. Seems like everything is
  MA clear. Guttman is the fastest, until I went to cpan. Found
  MA Sort::Key, which claims to be the fastest, even faster that ST,
  MA GRT. Now, before someone says, why not try each one and see for
  MA yourself (which doing such could be another subject for me to
  MA learn), my question is this: if such faster sorting algorithms
  MA exist, why don't they just replace the default sort function in
  MA Perl?

you need to understand how those techniques work and some sort theory as
well. perl's internal sort is a generic sort on strings (using the
quicksort algorithm). you aren't going to get faster than that for
general data (if you have specific known types of data you can sometimes
optimize it een more). what all the techniques do is optimize key
extraction. instead of extracting the keys for each comparison (this is
O(N * log N) in sort theory notation) you do it once for each key which
is O(N). so you transfer a large amount of work from N * log N to order
N. and the more data you have, the more time you save. packing the keys
into strings is another optimization as you don't need the callback for
each comparision (sort will use the internal string comparison). so that
is why the GRT which does a packed sort is faster than the ST which
needs callbacks and array accessing during each comparison.

as for Sort::Key vs Sort::Maker's speed, i haven't run a benchmark on
them. this isn't hard to do and would be good for you as an
exercise. they have very different api's and designs so that is an issue
too. also sort::maker can generate all 4 styles of sorts and you can get
the generated code for reuse later. that could save some more time as
well.

  MA And for the classical question, given my situation (in combination
  MA with File::Slurp), which is fastest sort method? (I hope somebody
  MA includes this in perlfaq in the future).

this isn't an FAQ so i doubt it will ever be added. sorting large data
sets in perl isn't that common. most large sorts are done in the DB
already or by external very fast sort programs. your megabyte size files
aren't even that large by today's standards (especially for slurping).

uri

-- 
Uri Guttman  --  u...@stemsystems.com    http://www.sysarch.com --
-  Perl Code Review , Architecture, Development, Training, Support --
-  Gourmet Hot Cocoa Mix    http://bestfriendscocoa.com -

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: the classic which is the fastest way to sort question

2009-11-09 Thread Philip Potter
2009/11/9 Michael Alipio daem0n...@yahoo.com:
 Hi,

 i'm planning to sort an input file (which was File::Slurp'ed, most likely 
 megabyte-sized file) in various ways. I did some readings and learned several 
 methods
 that people have come up with in recent years. So to summarize, the default 
 sort is fast (uses quick sort),

If default sort is quicksort, I can't find the docs that say so. They
seem to imply (but don't state explicitly) that default sort is
whatever was measured to be fastest on that platform, be it quicksort
or mergesort. See perldoc -f sort (function) and perldoc sort
(pragma).

explicit (using sub) is a bit slower,

What does this mean? You can write anything, including any sorting
algorithm, in a sub.

 other method
 that uses caching is faster. Then there's Schwartzian Transform and a packed 
 version by Guttman.

ST is not a sorting algorithm. It's a technique to improve the
efficiency of another sorting algorithm by reducing the cost of the
comparison function.

 Seems like everything is clear. Guttman is the fastest,
 until I went to cpan. Found Sort::Key, which claims to be the fastest, even 
 faster that ST, GRT. Now, before someone says, why not try each one and
 see for yourself (which doing such could be another subject for me to learn), 
 my question is this: if such faster sorting algorithms exist, why don't they 
 just replace the default sort function in Perl?

The default sort function is based on the classical sorting paradigm
of having a comparison function and trying to do as few comparisons as
possible. Any replacement for it must be a drop-in replacement in
order to not break existing code. In fact as perldoc -f sort shows,
there have been changes to the default sort over the years: perl 5.6
sort is quicksort, perl 5.7 sort is mergesort, and perl 5.8 and later
has a sort which doesn't seem specified but which can be controlled by
the 'use sort' pragma. The perl team are perfectly capable of
improving efficiency and functionality if they see a need to.

ST works on comparison-based sorting algorithms to make the comparison
function cheaper by precaching as much of the work as possible.
Whether this is more efficient or not depends on how expensive the
comparison is and how much work can be extracted into an ST. If you
need an ST, it's not because perl's sort() is inefficient; it is
because the comparison function you gave it is inefficient. There is
no improvement that the perl authors can ever make to sort() which
will make an ST unnecessary.

Sort::Key has a different interface: instead of a comparison function,
you give it a key-generating function. As a result, it is able to do
the precaching work itself, so no ST would be necessary. This means,
however, that it is not a drop-in replacement for sort(); to use
Sort::Key you have to rewrite code. Perl will always need a sort()
function to maintain backward compatibility and because every sort can
be expressed in terms of comparison functions.

As for whether or why Sort::Key is faster, I can't say because its
perldoc doesn't seem to give away its algorithm. One possible
algorithm it uses is radix sort, which is O(n) provided the calculated
keys are of bounded size -- which works for natural integers, but
fails for bignums.

 And for the classical question, given my situation (in combination with 
 File::Slurp), which is fastest sort method? (I hope somebody includes this in 
 perlfaq in the future).

I'm sorry to say, but this really will depend on your system. I don't
know how Sort::Key works, but regarding all the other sorts, you'd
have to measure it depending on your data and your system.

Do you need the fastest possible sort?

Philip

PS your email client has a very long line length, causing my quoting
above to go somewhat haywire. I'd recommend setting it to something
like 74.

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: the classic which is the fastest way to sort question

2009-11-09 Thread Michael Alipio
Hi,

 

  Do you need the fastest possible sort? 

I'm not even sure if I really need to worry about all these 
sorting techniques. My program just reads a text file 
(wordlist). It might be megabyte-sized or probably few 
gigabytes (i might also add size checking on this to be
safe with File::Slurp). Then I will give the user an option
of sorting it in various ways, like length, alphabetical,
numerical, frequency of letters, etc. 

I see, so it all boils down to how expensive the 
comparison you're going to implement to fully benefit 
from these techniques. Now comes another question 
for me to find the answer to, how expensive the 
comparisons in my sorting function would be... I 
guess there's no other way for me to find this out 
than to try it out  myself. What's worse is that 
there's also a depends on the system factor to 
consider as well. Sometimes I wish perl's motto 
is there's only one best way to do it so everyone
would just agree on one way of doing something, 
so everyone would have the same beautiful
and efficient code. For now, I will probably just stick
to using the built-in sort (just for sorting length, 
numbers, and letters), until I have gained enough 
knowledge about why it's necessary to use the 
other techniques, or how to do the benchmark 
myself.


 Philip

 PS your email client has a very long line length, causing my quoting
  above to go somewhat haywire. I'd recommend setting it to something
 like 74.





--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: the classic which is the fastest way to sort question

2009-11-09 Thread Shlomi Fish
On Monday 09 Nov 2009 19:40:29 Uri Guttman wrote:
  MA == Michael Alipio daem0n...@yahoo.com writes:
 
   MA i'm planning to sort an input file (which was File::Slurp'ed, most
   MA likely megabyte-sized file) in various ways. I did some readings
   MA and learned several methods that people have come up with in
   MA recent years. So to summarize, the default sort is fast (uses
   MA quick sort), explicit (using sub) is a bit slower, other method
   MA that uses caching is faster. Then there's Schwartzian Transform
   MA and a packed version by Guttman. Seems like everything is
   MA clear. Guttman is the fastest, until I went to cpan. Found
   MA Sort::Key, which claims to be the fastest, even faster that ST,
   MA GRT. Now, before someone says, why not try each one and see for
   MA yourself (which doing such could be another subject for me to
   MA learn), my question is this: if such faster sorting algorithms
   MA exist, why don't they just replace the default sort function in
   MA Perl?
 
 you need to understand how those techniques work and some sort theory as
 well. perl's internal sort is a generic sort on strings (using the
 quicksort algorithm). 

Just a small correction. Starting from perl-5.8.x, Perl's sort operator is 
using the Merge sort algorithm, which was shown to perform somewhat better 
than Quicksort in the benchmarks that were conducted:

http://en.wikipedia.org/wiki/Merge_sort#Comparison_with_other_sort_algorithms

Here's a little theory:

1. Merge sort has a *worst-case* complexity of O(N*log(N)), but it was 
believed to be less performant than Quicksort.

2. Quick Sort has a worst-case complexity of O(N**2) and a provably average 
case complexity of O(N*log(N)) , and was usually believed to be more 
performant than Merge sort. The standard C library (libc, the Windows C RTL, 
etc.) contains a qsort() function that implements Quicksort, but suffers from 
several philosophical limitations. perl 5 (the Perl 5 implementation) used to 
use qsort() up to version 5.005 or so, when it was replaced with a custom 
function.

Some of the traditional Quick Sort's worst (O(N**2)) performance is when it is 
used to sort in-order or close-to-in-order data, though some of the variations 
of it can be avoided.

3. Merge sort can also be efficiently used for sorting linked-lists as opposed 
to random-access arrays.

4. Generally, with small data sets, it is a better idea to use a naïve O(N**2) 
sorting algorithm such as Insertion Sort (which is generally better than 
Bubble Sort and saner).

5. Given some known assumptions on the data, one can sometimes sort at a 
better asymptotic complexity than O(N*log(N)):

* http://en.wikipedia.org/wiki/Counting_sort

* http://en.wikipedia.org/wiki/Radix_sort

--

Oh and general sort on strings should be general sort on scalars, though I 
suppose that's what you meant. Using perl's || operator one can conviently 
sort based on several aspects. E.g:


my @sorted_records =
(sort 
{ 
($a-{'last_name'} cmp $b-{'last_names})
||
($a-{'first_name'} cmp $b-{'first_name'})
||
($a-{'age'} = $b-{'age'})
}
@records)


(Untested.)

Too bad the || operator is magical in Perl and cannot be easily extended in 
user-land, though you can do something similar with or(sub { ... }, sub { ... 
}, sub { ... }, sub { ... }).

Regards,

Shlomi Fish

-- 
-
Shlomi Fish   http://www.shlomifish.org/
Original Riddles - http://www.shlomifish.org/puzzles/

Chuck Norris read the entire English Wikipedia in 24 hours. Twice.

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: the classic which is the fastest way to sort question

2009-11-09 Thread Philip Potter
2009/11/9 Michael Alipio daem0n...@yahoo.com:
 Hi,

  Do you need the fastest possible sort?

 I'm not even sure if I really need to worry about all these
 sorting techniques. My program just reads a text file
 (wordlist). It might be megabyte-sized or probably few
 gigabytes (i might also add size checking on this to be
 safe with File::Slurp). Then I will give the user an option
 of sorting it in various ways, like length, alphabetical,
 numerical, frequency of letters, etc.

 I see, so it all boils down to how expensive the
 comparison you're going to implement to fully benefit
 from these techniques.

Having a cheap comparison is important, but it's just one part of a
very large story. Donald Knuth, one of the masters of computer
science, wrote an entire textbook on searching and sorting:
http://www.amazon.com/Art-Computer-Programming-Sorting-Searching/dp/0201896850

This is a huge subject and we've barely scratched the surface. There
is no it all boils down to.

 Now comes another question
 for me to find the answer to, how expensive the
 comparisons in my sorting function would be... I
 guess there's no other way for me to find this out
 than to try it out  myself.

Actually, there is a certain amount of reasoning possible with respect
to comparison functions: for example, a Schwartzian Transform will be
a win if the key calculation is more expensive than the comparison of
keys.

 What's worse is that
 there's also a depends on the system factor to
 consider as well. Sometimes I wish perl's motto
 is there's only one best way to do it so everyone
 would just agree on one way of doing something,
 so everyone would have the same beautiful
 and efficient code.

There is no programming language which will automatically optimise the
best possible sort for you. Not Perl, not Python, not Java, not C++.
If you want the best, you will have to learn the theory. A compsci
course or a thorough read of Knuth will go far.

 For now, I will probably just stick
 to using the built-in sort (just for sorting length,
 numbers, and letters), until I have gained enough
 knowledge about why it's necessary to use the
 other techniques, or how to do the benchmark
 myself.

Oh, it's not too hard to do benchmarking, use cmpthese from Benchmark:
http://search.cpan.org/~dapm/perl-5.10.1/lib/Benchmark.pm

Phil

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: the classic which is the fastest way to sort question

2009-11-09 Thread Uri Guttman
 PP == Philip Potter philip.g.pot...@gmail.com writes:

  PP Actually, there is a certain amount of reasoning possible with respect
  PP to comparison functions: for example, a Schwartzian Transform will be
  PP a win if the key calculation is more expensive than the comparison of
  PP keys.

that is incorrect. it all depends on the growth of the algorithm. key
extraction can be the same work as comparisons but it becomes O(N). note
that with a straight call to sort() the comparison must also do key
extraction. if the keys are just a list of strings or numbers, then the
extraction becomes nothing and there is no win for any transform. pretty
much anything else will win given a large enough data set. that is
always the case. a slow as hell bubble sort (O(N**2)) is FASTER for 3
elements than any other sort only because the data set is so small. sort
speed is always about growth curves which is what O() notation is all
about. for instance O( N * log N + N ) is the SAME as O( N * log N
). the highest order term wins since it grows the fastest. and constant
factors are removed from the function as well. the win with the ST and
GRT is not in the growth curve but in the real world savings of
duplicated work (key extraction). it is done only once per key and not
for each comparison. note that you must always do some key extraction
(even if it is a no-op) so converting that from O( N * log N ) to O( N )
is a real world win especially with larger data sets. but the actual O()
value doesn't change! the sort comparison is still O( N * log N ) with
any of the transforms since sort is called.

uri

-- 
Uri Guttman  --  u...@stemsystems.com    http://www.sysarch.com --
-  Perl Code Review , Architecture, Development, Training, Support --
-  Gourmet Hot Cocoa Mix    http://bestfriendscocoa.com -

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/