the classic which is the fastest way to sort question
Hi, i'm planning to sort an input file (which was File::Slurp'ed, most likely megabyte-sized file) in various ways. I did some readings and learned several methods that people have come up with in recent years. So to summarize, the default sort is fast (uses quick sort), explicit (using sub) is a bit slower, other method that uses caching is faster. Then there's Schwartzian Transform and a packed version by Guttman. Seems like everything is clear. Guttman is the fastest, until I went to cpan. Found Sort::Key, which claims to be the fastest, even faster that ST, GRT. Now, before someone says, why not try each one and see for yourself (which doing such could be another subject for me to learn), my question is this: if such faster sorting algorithms exist, why don't they just replace the default sort function in Perl? And for the classical question, given my situation (in combination with File::Slurp), which is fastest sort method? (I hope somebody includes this in perlfaq in the future). -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: the classic which is the fastest way to sort question
MA == Michael Alipio daem0n...@yahoo.com writes: MA i'm planning to sort an input file (which was File::Slurp'ed, most MA likely megabyte-sized file) in various ways. I did some readings MA and learned several methods that people have come up with in MA recent years. So to summarize, the default sort is fast (uses MA quick sort), explicit (using sub) is a bit slower, other method MA that uses caching is faster. Then there's Schwartzian Transform MA and a packed version by Guttman. Seems like everything is MA clear. Guttman is the fastest, until I went to cpan. Found MA Sort::Key, which claims to be the fastest, even faster that ST, MA GRT. Now, before someone says, why not try each one and see for MA yourself (which doing such could be another subject for me to MA learn), my question is this: if such faster sorting algorithms MA exist, why don't they just replace the default sort function in MA Perl? you need to understand how those techniques work and some sort theory as well. perl's internal sort is a generic sort on strings (using the quicksort algorithm). you aren't going to get faster than that for general data (if you have specific known types of data you can sometimes optimize it een more). what all the techniques do is optimize key extraction. instead of extracting the keys for each comparison (this is O(N * log N) in sort theory notation) you do it once for each key which is O(N). so you transfer a large amount of work from N * log N to order N. and the more data you have, the more time you save. packing the keys into strings is another optimization as you don't need the callback for each comparision (sort will use the internal string comparison). so that is why the GRT which does a packed sort is faster than the ST which needs callbacks and array accessing during each comparison. as for Sort::Key vs Sort::Maker's speed, i haven't run a benchmark on them. this isn't hard to do and would be good for you as an exercise. they have very different api's and designs so that is an issue too. also sort::maker can generate all 4 styles of sorts and you can get the generated code for reuse later. that could save some more time as well. MA And for the classical question, given my situation (in combination MA with File::Slurp), which is fastest sort method? (I hope somebody MA includes this in perlfaq in the future). this isn't an FAQ so i doubt it will ever be added. sorting large data sets in perl isn't that common. most large sorts are done in the DB already or by external very fast sort programs. your megabyte size files aren't even that large by today's standards (especially for slurping). uri -- Uri Guttman -- u...@stemsystems.com http://www.sysarch.com -- - Perl Code Review , Architecture, Development, Training, Support -- - Gourmet Hot Cocoa Mix http://bestfriendscocoa.com - -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: the classic which is the fastest way to sort question
2009/11/9 Michael Alipio daem0n...@yahoo.com: Hi, i'm planning to sort an input file (which was File::Slurp'ed, most likely megabyte-sized file) in various ways. I did some readings and learned several methods that people have come up with in recent years. So to summarize, the default sort is fast (uses quick sort), If default sort is quicksort, I can't find the docs that say so. They seem to imply (but don't state explicitly) that default sort is whatever was measured to be fastest on that platform, be it quicksort or mergesort. See perldoc -f sort (function) and perldoc sort (pragma). explicit (using sub) is a bit slower, What does this mean? You can write anything, including any sorting algorithm, in a sub. other method that uses caching is faster. Then there's Schwartzian Transform and a packed version by Guttman. ST is not a sorting algorithm. It's a technique to improve the efficiency of another sorting algorithm by reducing the cost of the comparison function. Seems like everything is clear. Guttman is the fastest, until I went to cpan. Found Sort::Key, which claims to be the fastest, even faster that ST, GRT. Now, before someone says, why not try each one and see for yourself (which doing such could be another subject for me to learn), my question is this: if such faster sorting algorithms exist, why don't they just replace the default sort function in Perl? The default sort function is based on the classical sorting paradigm of having a comparison function and trying to do as few comparisons as possible. Any replacement for it must be a drop-in replacement in order to not break existing code. In fact as perldoc -f sort shows, there have been changes to the default sort over the years: perl 5.6 sort is quicksort, perl 5.7 sort is mergesort, and perl 5.8 and later has a sort which doesn't seem specified but which can be controlled by the 'use sort' pragma. The perl team are perfectly capable of improving efficiency and functionality if they see a need to. ST works on comparison-based sorting algorithms to make the comparison function cheaper by precaching as much of the work as possible. Whether this is more efficient or not depends on how expensive the comparison is and how much work can be extracted into an ST. If you need an ST, it's not because perl's sort() is inefficient; it is because the comparison function you gave it is inefficient. There is no improvement that the perl authors can ever make to sort() which will make an ST unnecessary. Sort::Key has a different interface: instead of a comparison function, you give it a key-generating function. As a result, it is able to do the precaching work itself, so no ST would be necessary. This means, however, that it is not a drop-in replacement for sort(); to use Sort::Key you have to rewrite code. Perl will always need a sort() function to maintain backward compatibility and because every sort can be expressed in terms of comparison functions. As for whether or why Sort::Key is faster, I can't say because its perldoc doesn't seem to give away its algorithm. One possible algorithm it uses is radix sort, which is O(n) provided the calculated keys are of bounded size -- which works for natural integers, but fails for bignums. And for the classical question, given my situation (in combination with File::Slurp), which is fastest sort method? (I hope somebody includes this in perlfaq in the future). I'm sorry to say, but this really will depend on your system. I don't know how Sort::Key works, but regarding all the other sorts, you'd have to measure it depending on your data and your system. Do you need the fastest possible sort? Philip PS your email client has a very long line length, causing my quoting above to go somewhat haywire. I'd recommend setting it to something like 74. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: the classic which is the fastest way to sort question
Hi, Do you need the fastest possible sort? I'm not even sure if I really need to worry about all these sorting techniques. My program just reads a text file (wordlist). It might be megabyte-sized or probably few gigabytes (i might also add size checking on this to be safe with File::Slurp). Then I will give the user an option of sorting it in various ways, like length, alphabetical, numerical, frequency of letters, etc. I see, so it all boils down to how expensive the comparison you're going to implement to fully benefit from these techniques. Now comes another question for me to find the answer to, how expensive the comparisons in my sorting function would be... I guess there's no other way for me to find this out than to try it out myself. What's worse is that there's also a depends on the system factor to consider as well. Sometimes I wish perl's motto is there's only one best way to do it so everyone would just agree on one way of doing something, so everyone would have the same beautiful and efficient code. For now, I will probably just stick to using the built-in sort (just for sorting length, numbers, and letters), until I have gained enough knowledge about why it's necessary to use the other techniques, or how to do the benchmark myself. Philip PS your email client has a very long line length, causing my quoting above to go somewhat haywire. I'd recommend setting it to something like 74. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: the classic which is the fastest way to sort question
On Monday 09 Nov 2009 19:40:29 Uri Guttman wrote: MA == Michael Alipio daem0n...@yahoo.com writes: MA i'm planning to sort an input file (which was File::Slurp'ed, most MA likely megabyte-sized file) in various ways. I did some readings MA and learned several methods that people have come up with in MA recent years. So to summarize, the default sort is fast (uses MA quick sort), explicit (using sub) is a bit slower, other method MA that uses caching is faster. Then there's Schwartzian Transform MA and a packed version by Guttman. Seems like everything is MA clear. Guttman is the fastest, until I went to cpan. Found MA Sort::Key, which claims to be the fastest, even faster that ST, MA GRT. Now, before someone says, why not try each one and see for MA yourself (which doing such could be another subject for me to MA learn), my question is this: if such faster sorting algorithms MA exist, why don't they just replace the default sort function in MA Perl? you need to understand how those techniques work and some sort theory as well. perl's internal sort is a generic sort on strings (using the quicksort algorithm). Just a small correction. Starting from perl-5.8.x, Perl's sort operator is using the Merge sort algorithm, which was shown to perform somewhat better than Quicksort in the benchmarks that were conducted: http://en.wikipedia.org/wiki/Merge_sort#Comparison_with_other_sort_algorithms Here's a little theory: 1. Merge sort has a *worst-case* complexity of O(N*log(N)), but it was believed to be less performant than Quicksort. 2. Quick Sort has a worst-case complexity of O(N**2) and a provably average case complexity of O(N*log(N)) , and was usually believed to be more performant than Merge sort. The standard C library (libc, the Windows C RTL, etc.) contains a qsort() function that implements Quicksort, but suffers from several philosophical limitations. perl 5 (the Perl 5 implementation) used to use qsort() up to version 5.005 or so, when it was replaced with a custom function. Some of the traditional Quick Sort's worst (O(N**2)) performance is when it is used to sort in-order or close-to-in-order data, though some of the variations of it can be avoided. 3. Merge sort can also be efficiently used for sorting linked-lists as opposed to random-access arrays. 4. Generally, with small data sets, it is a better idea to use a naïve O(N**2) sorting algorithm such as Insertion Sort (which is generally better than Bubble Sort and saner). 5. Given some known assumptions on the data, one can sometimes sort at a better asymptotic complexity than O(N*log(N)): * http://en.wikipedia.org/wiki/Counting_sort * http://en.wikipedia.org/wiki/Radix_sort -- Oh and general sort on strings should be general sort on scalars, though I suppose that's what you meant. Using perl's || operator one can conviently sort based on several aspects. E.g: my @sorted_records = (sort { ($a-{'last_name'} cmp $b-{'last_names}) || ($a-{'first_name'} cmp $b-{'first_name'}) || ($a-{'age'} = $b-{'age'}) } @records) (Untested.) Too bad the || operator is magical in Perl and cannot be easily extended in user-land, though you can do something similar with or(sub { ... }, sub { ... }, sub { ... }, sub { ... }). Regards, Shlomi Fish -- - Shlomi Fish http://www.shlomifish.org/ Original Riddles - http://www.shlomifish.org/puzzles/ Chuck Norris read the entire English Wikipedia in 24 hours. Twice. -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: the classic which is the fastest way to sort question
2009/11/9 Michael Alipio daem0n...@yahoo.com: Hi, Do you need the fastest possible sort? I'm not even sure if I really need to worry about all these sorting techniques. My program just reads a text file (wordlist). It might be megabyte-sized or probably few gigabytes (i might also add size checking on this to be safe with File::Slurp). Then I will give the user an option of sorting it in various ways, like length, alphabetical, numerical, frequency of letters, etc. I see, so it all boils down to how expensive the comparison you're going to implement to fully benefit from these techniques. Having a cheap comparison is important, but it's just one part of a very large story. Donald Knuth, one of the masters of computer science, wrote an entire textbook on searching and sorting: http://www.amazon.com/Art-Computer-Programming-Sorting-Searching/dp/0201896850 This is a huge subject and we've barely scratched the surface. There is no it all boils down to. Now comes another question for me to find the answer to, how expensive the comparisons in my sorting function would be... I guess there's no other way for me to find this out than to try it out myself. Actually, there is a certain amount of reasoning possible with respect to comparison functions: for example, a Schwartzian Transform will be a win if the key calculation is more expensive than the comparison of keys. What's worse is that there's also a depends on the system factor to consider as well. Sometimes I wish perl's motto is there's only one best way to do it so everyone would just agree on one way of doing something, so everyone would have the same beautiful and efficient code. There is no programming language which will automatically optimise the best possible sort for you. Not Perl, not Python, not Java, not C++. If you want the best, you will have to learn the theory. A compsci course or a thorough read of Knuth will go far. For now, I will probably just stick to using the built-in sort (just for sorting length, numbers, and letters), until I have gained enough knowledge about why it's necessary to use the other techniques, or how to do the benchmark myself. Oh, it's not too hard to do benchmarking, use cmpthese from Benchmark: http://search.cpan.org/~dapm/perl-5.10.1/lib/Benchmark.pm Phil -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: the classic which is the fastest way to sort question
PP == Philip Potter philip.g.pot...@gmail.com writes: PP Actually, there is a certain amount of reasoning possible with respect PP to comparison functions: for example, a Schwartzian Transform will be PP a win if the key calculation is more expensive than the comparison of PP keys. that is incorrect. it all depends on the growth of the algorithm. key extraction can be the same work as comparisons but it becomes O(N). note that with a straight call to sort() the comparison must also do key extraction. if the keys are just a list of strings or numbers, then the extraction becomes nothing and there is no win for any transform. pretty much anything else will win given a large enough data set. that is always the case. a slow as hell bubble sort (O(N**2)) is FASTER for 3 elements than any other sort only because the data set is so small. sort speed is always about growth curves which is what O() notation is all about. for instance O( N * log N + N ) is the SAME as O( N * log N ). the highest order term wins since it grows the fastest. and constant factors are removed from the function as well. the win with the ST and GRT is not in the growth curve but in the real world savings of duplicated work (key extraction). it is done only once per key and not for each comparison. note that you must always do some key extraction (even if it is a no-op) so converting that from O( N * log N ) to O( N ) is a real world win especially with larger data sets. but the actual O() value doesn't change! the sort comparison is still O( N * log N ) with any of the transforms since sort is called. uri -- Uri Guttman -- u...@stemsystems.com http://www.sysarch.com -- - Perl Code Review , Architecture, Development, Training, Support -- - Gourmet Hot Cocoa Mix http://bestfriendscocoa.com - -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/