On Thu, Nov 11, 2021 at 11:22 AM Antoine Pitrou <anto...@python.org> wrote:
> On Thu, 11 Nov 2021 11:01:32 -0800 > Richard Levasseur <richard...@gmail.com> wrote: > > > > In the end, it was a fun exercise, but in practice a dictionary and > > sorted() got me 90% of the way there and sufficed. Optimizing that last > 10% > > wasn't worth the effort. > > > > Anyways, I came to two particular, IMHO, conclusions: > > 1. Sorted collections are very niche. I *could* have used one, but > probably > > would have spent as much time fiddling with them as using a > dict/sorted(). > > So the fact that sorted collections did not help for one single use > case leads you to conclude that "sorted collections are very niche"? > I don't find this very convincing. > It was just an example demonstrating that, despite all the apparent "things need to be kept sorted"-ness of the problem, the strict requirement that data be fully sorted at every step of the code wasn't actually present. That sorting at the end is quite sufficient. It just had the mere appearance of being a requirement. This example is not unlike the exchange Paul Bryan and Christopher Baker had upthread where Paul gave an example, and Christopher pointed out sorting at the end was sufficient. And no, that I didn't choose[1] to use a sorted collection for that particular case didn't make me conclude they were very niche. Reflecting on my experience coding and reviewing code over, eh, 20 years? Or w/e. lead me to realize that very rarely have I seen, needed, or recommended using a sorted collection. In my experience, the order of elements in a collection mattering is the exception, not the rule. Hence, I conclude they're niche. [1] As my example described, a sorted collection would have been able to realize some optimal performance characteristic, but eking out that last bit wasn't worth the effort. > > Unfortunately, PyPI doesn't seem to offer a way to query the reverse > dependencies of a package, otherwise we could know how many packages > depend on the "sortedcontainers" package. > > Querying my employer's code base, the number of hits for sortedcontainers isn't particularly impressive. It's not so low to be dismissed, but it's also not so high as to be anywhere near unequivacable. Forgive me for not sharing the actual numbers; I'm just not really sure on what sort of code base statistics and numbers I'm permitted to share under what circumstances etc. > [...] > > > 2. If you're in a niche use case, especially for performance, then > > abstractions aren't really helpful. Using e.g. a BTree and knowing the > > particulars of the implementation are going to be more helpful than > having > > an abstract SortedList and working out how to use its special APIs for > > particular actions. > > You don't have "an abstract SortedList", you have a concrete SortedList > with a well-defined implementation with stable performance > characteristics. So, again, this seems to be a red herring. > My point is that there are many ways to implement a collection that maintains its elements in sorted order that has well defined performance. One can pick one and accept the pros/cons of the particular underlying algorithm. But it seems misleading to call that *the* SortedList implementation. Hence why I say "abstract SortedList". To emphasize this point a bit, let's presume for the moment Python chose to have a data type named SortedList, and it was, effectively, the sortedcontainers implementation. What does one do with the "load factor" setting that sortedcontainers.SortedList has? This arg is, essentially, a BTree implementation detail leaking out. IIRC, it controls the fan out of the underlying BTree (because one size doesn't always fit all), which is one of the biggest determiners of the overall performance. If such a parameter is exposed, then it's no longer really a generic "sorted list", it's a "sorted list as implemented using a btree". So might as well call it e.g. BTreeList or some such instead. If such a parameter *isn't* exposed, then a one-size-fits-all value must be chosen[2] -- I'm skeptical that's really tenable because, if a sorted collection is needed, it's probably for performance reasons, and you need knobs like that to better fit the algorithm to your data and access patterns. And I guess I have to state upfront that my claim of "it's probably for performance reasons" is just what my experience leads me to believe, so it'll just have to be taken with the dosage of salt that best fits, too. [2] As an aside, that a list's growth factor isn't controllable works against things like sortedcontainers. Which isn't to say `list` should expose it, but that if, e.g. ArrayList(growth_factor) was a thing, then a more optimal implementation of e.g. BTree would be possible. > > > This basically echos Christopher Baker's sentiment of "whats the use > case" > > and "why doesn't a sorted() call at the end suffice" to some extent. > > The use case is obviously when you have to *maintain* sorting at all > times. Yes, there are such situations. Some rather high-profile > Python packages, such as dask/distributed, rely on this for rather > elaborate algorithmic work (dask/distributed is a distributed task > scheduler, you may also think of it as a replacement for Spark). > > Are you able to elaborate wrt their needs to use a sorted collection? Something that would be very compelling (to me, at least) is demonstrating the utility of sorted collections where their performance characteristics isn't the only/sole/primary/dominant motivator for using one. Performance is important, but it's also kind of fuzzy and easy to argue about without a clear answer/conclusion. I hope we can all agree that there do exist situations where a sorted collection is, at the least, helpful. That they aren't inherently useless. I think the main thing being questioned is when/if they are, in effect, *necessary throughout*. That when this is true: * The data must, at all times, without exception, be and stay sorted from its inception to its conclusion Then this is also true: * The case is not niche Does "dask/distributed" meet that criteria? If so, why? I have no experience, knowledge, or opinion about "dask/distributed". Maybe it *is* a compelling example, or maybe not. IDK. > I won't go over your performance considerations, but I'll just mention > that the "sortedcontainers" documentation has not one but *several* > sections exploring performance characteristics in several dimensions. > It is extremely rare for Python packages to provide such a detailed > insight on the matter (even the built-in `dict` is much less thoroughly > characterised). > http://www.grantjenks.com/docs/sortedcontainers/#user-guide Yes, his performance analysis really is top notch. > > > Regards > > Antoine. > > > _______________________________________________ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/EHZGBXSIXZE6ZDBGJPESDQQG4PCNOMLW/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3KGCC4KK2NTXEZWVFAYJHMMKWBWWQN2Z/ Code of Conduct: http://python.org/psf/codeofconduct/