On Thu, Nov 11, 2021 at 11:22 AM Antoine Pitrou <anto...@python.org> wrote:

> On Thu, 11 Nov 2021 11:01:32 -0800
> Richard Levasseur <richard...@gmail.com> wrote:
> >
> > In the end, it was a fun exercise, but in practice a dictionary and
> > sorted() got me 90% of the way there and sufficed. Optimizing that last
> 10%
> > wasn't worth the effort.
> >
> > Anyways, I came to two particular, IMHO, conclusions:
> > 1. Sorted collections are very niche. I *could* have used one, but
> probably
> > would have spent as much time fiddling with them as using a
> dict/sorted().
>
> So the fact that sorted collections did not help for one single use
> case leads you to conclude that "sorted collections are very niche"?
> I don't find this very convincing.
>

It was just an example demonstrating that, despite all the apparent "things
need to be kept sorted"-ness of the problem, the strict requirement that
data be fully sorted at every step of the code wasn't actually present.
That sorting at the end is quite sufficient. It just had the mere
appearance of being a requirement. This example is not unlike the exchange
Paul Bryan and Christopher Baker had upthread where Paul gave an example,
and Christopher pointed out sorting at the end was sufficient.

And no, that I didn't choose[1] to use a sorted collection for that
particular case didn't make me conclude they were very niche. Reflecting on
my experience coding and reviewing code over, eh, 20 years? Or w/e. lead me
to realize that very rarely have I seen, needed, or recommended using a
sorted collection. In my experience, the order of elements in a collection
mattering is the exception, not the rule. Hence, I conclude they're niche.

[1] As my example described, a sorted collection would have been able to
realize some optimal performance characteristic, but eking out that last
bit wasn't worth the effort.


>
> Unfortunately, PyPI doesn't seem to offer a way to query the reverse
> dependencies of a package, otherwise we could know how many packages
> depend on the "sortedcontainers" package.
>
>
Querying my employer's code base, the number of hits for sortedcontainers
isn't particularly impressive. It's not so low to be dismissed, but it's
also not so high as to be anywhere near unequivacable. Forgive me for not
sharing the actual numbers; I'm just not really sure on what sort of code
base statistics and numbers I'm permitted to share under what circumstances
etc.


> [...]
>
> > 2. If you're in a niche use case, especially for performance, then
> > abstractions aren't really helpful. Using e.g. a BTree and knowing the
> > particulars of the implementation are going to be more helpful than
> having
> > an abstract SortedList and working out how to use its special APIs for
> > particular actions.
>
> You don't have "an abstract SortedList", you have a concrete SortedList
> with a well-defined implementation with stable performance
> characteristics.  So, again, this seems to be a red herring.
>

My point is that there are many ways to implement a collection that
maintains its elements in sorted order that has well defined performance.
One can pick one and accept the pros/cons of the particular underlying
algorithm. But it seems misleading to call that *the* SortedList
implementation. Hence why I say "abstract SortedList".

To emphasize this point a bit, let's presume for the moment Python chose to
have a data type named SortedList, and it was, effectively, the
sortedcontainers implementation. What does one do with the "load factor"
setting that sortedcontainers.SortedList has? This arg is, essentially, a
BTree implementation detail leaking out. IIRC, it controls the fan out of
the underlying BTree (because one size doesn't always fit all), which is
one of the biggest determiners of the overall performance. If such a
parameter is exposed, then it's no longer really a generic "sorted list",
it's a "sorted list as implemented using a btree". So might as well call it
e.g. BTreeList or some such instead.

If such a parameter *isn't* exposed, then a one-size-fits-all value must be
chosen[2] -- I'm skeptical that's really tenable because, if a sorted
collection is needed, it's probably for performance reasons, and you need
knobs like that to better fit the algorithm to your data and access
patterns.

And I guess I have to state upfront that my claim of "it's probably for
performance reasons" is just what my experience leads me to believe, so
it'll just have to be taken with the dosage of salt that best fits, too.

[2] As an aside, that a list's growth factor isn't controllable works
against things like sortedcontainers. Which isn't to say `list` should
expose it, but that if, e.g. ArrayList(growth_factor) was a thing, then a
more optimal implementation of e.g. BTree would be possible.


>
> > This basically echos Christopher Baker's sentiment of "whats the use
> case"
> > and "why doesn't a sorted() call at the end suffice" to some extent.
>
> The use case is obviously when you have to *maintain* sorting at all
> times.  Yes, there are such situations.  Some rather high-profile
> Python packages, such as dask/distributed, rely on this for rather
> elaborate algorithmic work (dask/distributed is a distributed task
> scheduler, you may also think of it as a replacement for Spark).
>
>
Are you able to elaborate wrt their needs to use a sorted collection?

Something that would be very compelling (to me, at least) is demonstrating
the utility of sorted collections where their performance characteristics
isn't the only/sole/primary/dominant motivator for using one. Performance
is important, but it's also kind of fuzzy and easy to argue about without a
clear answer/conclusion.

I hope we can all agree that there do exist situations where a sorted
collection is, at the least, helpful. That they aren't inherently useless.

I think the main thing being questioned is when/if they are, in
effect, *necessary
throughout*.
That when this is true: * The data must, at all times, without exception,
be and stay sorted from its inception to its conclusion
Then this is also true:  * The case is not niche

Does "dask/distributed" meet that criteria? If so, why? I have no
experience, knowledge, or opinion about "dask/distributed". Maybe it *is* a
compelling example, or maybe not. IDK.


> I won't go over your performance considerations, but I'll just mention
> that the "sortedcontainers" documentation has not one but *several*
> sections exploring performance characteristics in several dimensions.
> It is extremely rare for Python packages to provide such a detailed
> insight on the matter (even the built-in `dict` is much less thoroughly
> characterised).
> http://www.grantjenks.com/docs/sortedcontainers/#user-guide


Yes, his performance analysis really is top notch.


>
>
> Regards
>
> Antoine.
>
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/EHZGBXSIXZE6ZDBGJPESDQQG4PCNOMLW/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3KGCC4KK2NTXEZWVFAYJHMMKWBWWQN2Z/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to