[Python-Dev] Re: Python Software Foundation Board of Directors Election 2022
Heads up! I found my PSF Board voting info in my gmail spam folder today; looks like it was mailed out this morning. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UDX2JHQZHICWWWMUGD3LBH32ABHWVCZC/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Can I ask a real dumb procedural question about GitHub email?
[Skip Montanaro ] > I subscribe to the python/cpython stuff on GitHub. I find it basically > impossible to follow because of the volume. > ... > How (if at all) do people deal with this firehose of email? Am I the > only person dumb enough to have tried? My observation is that, over time, all sources of email strive to enlist the aid of every electron in the universe to transform all of creation into more copies of its output, directed to my gmail account. I gave up. These days, if Google doesn't put something in the "priority" section of my gmail inbox, chances are the only thing I'll see of it is a fraction-of-a-second glance at the subject line as I check the "select all" box, on page after page, before deleting it. That gets done every day almost as often as clicking on "report spam and unsubscribe". I fondly remember the days when email volume was merely unbearable ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FE7522WPE6Q3W55KNNFLVZXR6G3CWCNH/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?
[Barry Scott and Steve Dower share tips for convincing Visual Studio to show assembler without recompiling the file] Thanks, fellows! That mostly ;-) workedl. Problem remaining is that breakpoints just didn't work. They showed up "visually", and in the table of set breakpoints, but code went whizzing right by them every time. I didn't investigate. It's possible, e.g., that the connection between C source and the generated PGO code was so obscure that VS just gave up - or just blew it. Instead I wrote a Python loop to run a division of interest "forever". That way I hoped I'd be likely to land in the loop of interest by luck when I broke into the process. Which worked! So here's the body of the main loop: 7FFE451D2760 mov eax,dword ptr [rcx-4] 7FFE451D2763 lea rcx,[rcx-4] 7FFE451D2767 shl r9,1Eh 7FFE451D276B or r9,rax 7FFE451D276E cmp r8,0Ah 7FFE451D2772 jne long_div+25Bh (07FFE451D27CBh) 7FFE451D2774 mov rax,rdi 7FFE451D2777 mul rax,r9 7FFE451D277A mov rax,rdx 7FFE451D277D shr rax,3 7FFE451D2781 mov dword ptr [r10+rcx],eax 7FFE451D2785 mov eax,eax 7FFE451D2787 imulrax,r8 7FFE451D278B sub r9,rax 7FFE451D278E sub rbx,1 7FFE451D2792 jns long_div+1F0h (07FFE451D2760h) And above the loop is this line, which you'll recognize as loading the same scaled reciprocal of 10 as the gcc code Mark posted earlier. The code above moves %rdi into %rax before the mul instruction: 7FFE451D2747 mov rdi,0CCCDh Note an odd decision here:the MS code compares the divisor to 10 on _every_ iteration. There are not two, "10 or not 10?", loop; bodies. Instead, if the divisor isn't 10, "jne long_div+25Bh" jumps to code not shown here, a few instructions that use hardware division, and then jump back into the tail end of the loop above to finish computing the remainder (etc). So they not only optimized division by 10, they added a useless test and two branches to every iteration of the loop when we're not dividing by 10 ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DGRMIVHOIAYQEQJD32ED4MCBOPFE5SIT/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?
[Guido] > I don't think there's a way to do a PGO build from Visual Studio; but > a command prompt in the repo can do it using `PCbuild\build.bat --pgo`. > Just be patient with it. Thanks! That worked, and was easy, and gave me an executable that runs "// 10" at supernatural speed. Alas, Visual Studio will not show a disassembly window unless the debugger is running, and there appears to be no way to convince it to run its debugger without it first recompiling the source file you're staring at. Which it recomplies without benefit of PGO. So, in the end, it was an elaborate way to reproduce the ;non-PGO optimized machine code I already saw :-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/D47HHYYASRHS56AZL6SEK4Y7K5FNSJOP/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?
[Tim, incidentally notes that passing 10 as the divisor to inplace_divrem1() is "impossibly fast" on Windows, consuming less than a third the time as when passing seemingly any other divisor] [Mark Dickinson, discovers much the same is true under other, but not all, Linux-y builds, due to the generated code doing a runtime(!) check for 10 and using a scaled integer reciprocal multiplication trick, instead of HW division, in that specific case] And there's a picture of this in the dictionary, next to the "dubious optimization" entry ;-) I have to believe the same is true under Visual Studio 2019, but offhand don't know how to prove that. I understand Steve uses PGO to build the python.org Windows release, but I've never done that - the "Release build" configuration I get from Github does not use PGO, and the code I get from my own "release builds" is equally slow for all divisors (and the generated assembler is obviously not trying to special-case anything). You later gave PGO the credit, which made sense from the start. But I _assumed_ PGO was latching on to // 10 because all sorts of programs use that all the time simply as a consequence of converting Python integers to strings. But that's not it, The base-conversion code doesn't call inplace_divrem1() - it does its own division loops with compile-time constants. So, as you already said ;-), it appears to be mostly an accident due to the specific tests we feed to PGO builds. More interesting to me now is the "over 3x faster" empirical observation. The paper I referenced earlier has a cheaper way to use scaled integer reciprocals, requiring nothing worse than 32x32->64 int mult, and, to compute the fudged reciprocal, nothing worse than 64x32-:>32 int divide. OTOH, libdivide already has working code ready to use, and - unlike the paper I referenced - does not require "normalizing" the divisor (shifting left to ensure it's >= base/2). libdivide needs more cycles to compute its 64-bit reciprocal, but in the context of bigints that's amortized over the number of digits in the numerator. A win is a win ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?
[Gregory P. Smith ] > ... > That only appears true in default boring -O2 builds. Use > `./configure --enable-optimizations` and the C version is *much* faster > than your asm one... > > 250ns for C vs 370ns for your asm divl one using old gcc 9.3 on my > zen3 when compiled using --enable-optimizations. Something is missing here, but can't guess what without seeing the generated machine code.But I trust Mark will do that. > tested using ` -m timeit -n 150 -s 'x = 10**1000; r=x//10; assert r == > 10**999, r' 'x//10' ` > > Use of __asm__ appears to have prevented the compiler from being able to fully > optimize that code in PGO/FDO mode. > > I trust the compiler toolchain to know what's close to best here. Mark is extremely experienced in numeric programming, and is the most capable numerical analyst contributing to CPython. That's why he trusts nothing ;-) > ... there's probably interesting ways to optimize bignum division using > opencl and vector hardware as well - for the case when you know you've > got over a dozen digits; but that's what numpy et. al. are for. numpy has no support for bigints, beyond allowing the use of PyObject* in numpy arrays (dtype='object') .In which case _all_ the bigint math is performed by the hosting Python implementation. People dead serious about bigint speed use the GMP library, which is a massive amount of code and 100% devoted to peak speed. In Python, gmpy2 supplies Python bindings for the GMP/MPIR, MPFR, and MPC libraries. > Bignums in python are a convenience. Most values normal code deals > with are less than 2**100. There's very little that "most normal code" uses. Not bigints, not HTML, not regexps, not sockets, on & on & on..Whichever bubbles you live in are nevertheless bubbles ;-) The speed of + - * // impose fundamental limits on the speed of almost all bigint algorithms. A reasonably quick win for bigint // would be most welcome. Anyone dead serious about working with bigints in Python "should" install gmpy2. But it's still fun to prototype work in plain old Python. Note: I added an implementation of Karatsuba multiplication to CPython two decades ago (which gives a major O() improvement over "schoolbook" multiplication). I've often said that was a mistake,which I wouldn't make again. Because it added relative mountains of new code to longobject.c, and people who need fast bigint * should _still_ look to GMP anyway (which implements several additional, even more gonzo, * algorithms). But neither do I want to remove it. It works fine, and the speed boost is welcome (in my bubbles, multi-thousand bit ints are common). What Mark is looking at here has a _very_ much more limited scope: the two (I think) places in longobject.c that are dividing two native-size machine ints in time-critical loops. "But nobody divides big ints anyway - why would they?". For example, if you're investigating crypto schemes, modular exponentiation with multi-thousand bit ints can be common, and under the covers, CPython uses the same code for both // and %. One modular exponentiation has to do a * and % for each bit position in the exponent, plus more proportional to the number of bits set in the exponent. About trusting "the toolchain", here under Win84 on a capable-enough desktop box (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 3601 Mhz, 4 Core(s), 8 Logical Processor(s), using the released 3.10.1: $ python --version Python 3.10.1 $ python -m timeit -n 150 -s "x = 10**1000" "x//10" 150 loops, best of 5: 376 nsec per loop Which actually makes little sense to me. 10**1000 requires 111 CPython "digits", which is the number of times the loop in `inplace_divrem1()` has to go around. Under 4 nsec per iteration seems close to impossibly fast on a 3.8GHz box, given the presence of any division instruction. However, dividing by 10 is not a worst case on this box. Dividing by 100 is over 3x slower: $ python -m timeit -n 150 -s "x = 10**1000" "x//100" 150 loops, best of 5: 1.25 usec per loop Dividing by the largest single Python "digit" is apparently the same: $ python -m timeit -n 150 -s "x = 10**1000; d=2**30-1" "x//d" 150 loops, best of 5: 1.29 usec per loop In fact, much the same story for dividing by 1, 2, 3. 4, 5, 6, 7, 8, 9, and 11. Got bored then ;-) They're _all_ much slower than dividing by 10 in this case. Trust nothing :-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GLKIVXMJMVH35SFIR455O6SURNOTZCKD/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?
]Mark Dickinson ] >> Division may still be problematic. Heh. I'm half convinced that heavy duty bigint packages are so often written in assembler because their authors are driven insane by trying to trick C compilers into generating "the obvious" machine instructions needed. An alternative to HW division is to multiply by a scaled integer reciprocal instead, but it's very delicate to get all cases right. GMP heavyweights Niels Moller and Torbjorn Granlund wrote up the most efficient I've seen of this ilk in their paper "Improved division by invariant integers". It requires a single x single -> double multiply, and another but only the low bits from that one, to get the right quotient and remainder. Ironically, it would run faster if CPython used all 32 bits of its internal digits - some operations in their algorithm are expected to overflow at times (including + and -!), andi it's crucial that they be done modulo the base in use. That happens "by magic" if the base matches the unsigned type's width. They only need a single-width scaled reciprocal, which can be computed with a HW double / single -> single divide if available, or via excruciating division-free scaled integer Newton refinement. Of course the cost of computing the reciprocal once pays off each time the same denominator is used. Curiously, """ It is curious that on these processors, the combination of our reciprocal algorithm (Alg. 2) and division algorithm (Alg. 4) is significantly faster than the built in assembler instruction for 2/1 division. """ HW division may have gotten faster since then, though. > On that note: Python divisions are somewhat crippled even on x64. Assuming > 30-bit digits, the basic building block that's needed for multi-precision > division > is a 64-bit-by-32-bit unsigned integer division, emitting a 32-bit quotient > (and > ideally also a 32-bit remainder). Which is an instance of what they mean by "2/1 division" above. > ... > IOW, forcing use of DIVL instead of DIVQ, in combination with getting the > remainder > directly from the DIV instruction instead of computing it separately, gives a > 41% > speedup in this particular worst case. I'd expect the effect to be even more > marked > on x86, but haven't yet done those timings. > ... > I don't know whether we really want to open the door to using inline assembly > for performance reasons in longobject.c, but it's interesting to see the > effect. Indeed it is! But changing the problem to use multiplication instead may be both faster and more portable, albeit at the cost of subtler code. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QGCHH6B7HCQRARPPVELVROO6FMWQY3G4/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?
>> The reason for digits being a multiple of 5 bits should be revisited vs >> its original intent > I added that. The only intent was to make it easier to implement > bigint exponentiation easily ... That said, I see the comments in longintrepr.h note a stronger constraint: """ the marshal code currently expects that PyLong_SHIFT is a multiple of 15 """ But that's doubtless also shallow. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YSNDNGB2QUBS3ZX2JZNFZCDITRTDGBEM/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Is anyone using 15-bit PyLong digits (PYLONG_BITS_IN_DIGIT=15)?
[Gregory P. Smith ] > The reason for digits being a multiple of 5 bits should be revisited vs > its original intent I added that. The only intent was to make it easier to implement bigint exponentiation easily while viewing the exponent as being in base 32 (so as to chew up 5 bits at a time).. Since the number of digit bits _was_ 15 at the time, and it was 'obvious enough" that 30 bits would work fine when 64-bit boxes became important enough to care about, it was easy to settle on "multiple of 5" as a quick, pragmatic tradeoff. The number controls how much space and time is needed to precompute a lookup table for a "large" (many bits) power - and powers can be very large indeed in the modular exponentiation case. I doubt Python will move beyond 30-bit digits. While 64-bit support became increasingly and steadily more widespread (both SW and HW) in the 32-bit world. I see very much less enthusiasm to move beyond 64 bits. Yes, 64x64 -> 128 multiply is spreading, but that's about it. Some applications can make very good use of that, but, in the grand scheme of things, they remain niche. Whole universes of apps are getting real benefit by moving from 32 to 64 bits. That's needed just to get integers big enough to represent physical memory addresses in many consumer desktops now. But we'll likely never need to go beyond 64 bits for that. In any case, "multiple of 5" is a shallow constraint. and getting rid of it shouldn't require more than rewriting longobject.c's long_pow()'s for (j = PyLong_SHIFT - 5; j >= 0; j -= 5) { loop in a more convoluted way to cross digit boundaries when trying to suck up "the next" 5 bits. In fact, there are good reasons to make that loop more convoluted anyway, but I never got around to it (it "should" instead be aiming to find "the next" 5-bit slice that ends with a 1-bit, which would allow cutting the aforementioned precomputed table in half). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CP3EH5EWVDJKN63XAOMDCOL4OTWUF2JY/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Having Sorted Containers in stdlib?
I started writing up a SortedDict use case I have, but it's very elaborate and I expect it would just end with endless pointless argument about other approaches I _could_ take. But I already know all those ;-) So let's look at something conceptually "dead easy" instead: priority queues. They're a basic building block in algorithms from many fields. In the distributed Python, `heapq` is the only thing that comes close to being reasonably efficient and scalable for that. However: - It only supports min-heaps. It's not that max-heaps aren't also useful (indeed, _under the covers_ CPython also implements max-heaps for its own internal use). It's more that the original API simply doesn't generalize cleanly. - So there's a body of word-of-mouth "tricks" for making a min-heap "act like" a max-heap. For integer or float priorities, it suffices to use the negation of "the real" priority. For other kinds of values, it gets trickier. In the end, you can wrap each item in a class that implements `x.__lt__(y)` by doing `y.value < x.value`. But few people appear to be aware of that, while nobody wants to endure the bother ;-) - It's often the case that "the priority" needs to be computed from an item's value. `heapq` has no direct support for that. Instead there's another body of word-of-mouth tricks akin to the old decorate-sort-undecorate approach sorting used. Rather than store the values in the heap, you store `(priority, value)` tuples. But then various unpleasant things can happen if two priorities are tied: in that case tuple comparison falls back to comparing the values. But, e.g., value comparison may be very expensive, or the values may not support __lt__ at all. So instead you store `(priority, unique_integer, value)` triples. And hope that you don't really need a max-heap, lest it get even more obscure. Using SortedContainers, none of those are issues. It's all straightforward and obvious. If you have a list L always sorted by priority, extracting from a "min queue" just requires L.pop(0), or from a "max queue" L.pop(-1) (or L.pop() - -1 is the default, same as for the Python list.pop()). No tricks are needed. Fine too if sometimes you want to extract the min, while at other times the max. Or, e.g., peek at the 5 highest-priority values and pick the one that best fits resources available at the time. In cases of computed priorities, the SortedKeyList constructor allows specifying a function to be used to compute the keys used to determine the list's ordering. Again straightforward and obvious. from sortedcontainers import SortedKeyList L = SortedKeyList([(i, j) for i in range(1, 5) for j in range(1, 5)], key=lambda t: t[0]/t[1]) >>> [t for t in L] [(1, 4), (1, 3), (1, 2), (2, 4), (2, 3), (3, 4), (1, 1), (2, 2), (3, 3), (4, 4), (4, 3), (3, 2), (2, 1), (4, 2), (3, 1), (4, 1)] That said, in most of my code I end up _removing_ uses of SortedContainers, in favor of faster ways of getting the job done. The package isn't the last resort for me, it's the first. I can write code in straightforward ways that scale well from the start. If I need more speed later, fine, I can puzzle out faster ways to get the task done. If you're proud of that you _start_ by plotting ways to minimize the number of sorts you do, you're coding in C++, not Python ;-) So I suggest people are staring at the wrong end when asking for use cases that can't be done without the package. Those are necessarily non-trivial. Having a sorted collection is "more than good enough" for many tasks. As above, e.g., there's no reason to imagine that Grant Jenks had priority queues in mind at all, yet the flavors of sorted lists he implemented are very easy to use for that task. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XXAXVV3KJ4ORQ7K6ELSS4R7S5725ASRE/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Having Sorted Containers in stdlib?
[Christopher Barker ] > Earlier in the thread, we were pointed to multiple implementations. > > Is this particular one clearly the “best”[*]? > > If so, then sure. > > -CHB > > [*] best meaning “most appropriate for the stdlib”. A couple folks have > already pointed to the quality of the code. But my understanding is that > different algorithms are more or less appropriate for different use > cases. So is this one fairly “universal”? As noted earlier, SortedContainers is downloaded from PyPI hundreds of thousands of times every day, and that's been so for a long time.. Best I can tell, no similar package is in the same universe as that. The author identified over a dozen potential competitors here[1], but most didn't look like serious projects to him. He ran extensive timing tests against the rest, reported there too. HIs package usually wins, and is at worst competitive. It usually wins on memory burden too. Some of these things are people intrigued by "a data structure" they read about and implement for fun and self-education. That sometimes shines through in data-structure-specific APIs. While the SortedContainers extensive docs explain how things are implemented (if you look in the right spots for that), the API is remarkably agnostic (& useful)). For example, this is a method of SortedList: irange(minimum=None, maximum=None, inclusive=(True, True), reverse=False) which returns an iterator over a contiguous sorted range of values (neither, either, or both of whose endpoints may be included, depending on what you pass for `inclusive` This "makes sense" regardless of how the structure may be implemented. Of course SortedSet and SortedDict support `irange()` too. [1] http://www.grantjenks.com/docs/sortedcontainers/performance.html ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FD3DSXZUBDUC3PFQH2SS47NIZ36TTOL6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Having Sorted Containers in stdlib?
[Bob Fang ] > This is a modest proposal to consider having sorted containers > (http://www.grantjenks.com/docs/sortedcontainers/) in standard library. +1 from me, but if and only if Grant Jenks (its author) wants that too. It's first-rate code in all respects, including that it's a fine example _of_ Python programming (it's not written in C - in Python). People often say "well, put a thing on PyPI first, and see whether people really want it!". SortedContainers has been on PyPI for years straight now, and still gets hundreds of thousands of downloads from there every day: https://pypistats.org/packages/sortedcontainers So it's already widely adopted in the community. What else would it need to prove? If it still doesn't qualify, "well, put a thing on PyPI first" is just obstructionism ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SE6YOVJQI6OEWNM5SAQL3X5VPEFGVZAA/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Having Sorted Containers in stdlib?
[Christopher Barker ] > Maybe a stupid question: > > What are use cases for sorted dicts? > > I don’t think I’ve ever needed one. For example, for some mappings with totally ordered keys, it can be useful to ask for the value associated with a key that's not actually there, because "close to the key is good enough". A SortedDict supports several ways of doing that efficiently, depending on what - exactly - an app means by "good enough". For example, it's easy to find _the_ closest key. Or the k closest keys. Or the two single keys <= and >=. Or ... Pick one; take some form of average of their values; use a domain-specific interpolation function on their keys & values; ... To make it concrete, suppose you're running a COVID-19 app where reports of new cases are coming in around the clock. One SortedDict may be used to map latitude to the number of confirmed cases reported. If a researcher asks for the number at a specific latitude, but no exact match is found, fine, you can still efficiently show them the results across the smallest known interval containing the latitude of interest. Or if they want the results broken into buckets of latitudes in 15-degree (N-degree - any N they ask for) intervals, also easy: enumerating all entries in a given key range also goes fast. Or ... A sorted list supports all sorts of membership queries efficiently. Under the covers, a SortedDict _is_ a SortedList, plus a regular Python dict to associate values with the list's elements. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FVDQZ4FJMSPYYTKNDRGLUXMOE226YIDK/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: containment and the empty container
[Ethan Furman ] > When is an empty container contained by a non-empty container? That depends on how the non-empty container's type defines __contains__. The "stringish" types (str, byte, bytearray) work _very_ differently from others (list, set, tuple) in this respect. t in x for the latter answers whether t is contained in x, but in the former case answers whether t is a _sub_sequence (contiguous) of x. So, e.g., >>> "ab" in "cab" True >>> b"ab" in b"cab" True >>> bytearray(b"ab") in bytearray(b"cab") True while >>> ['a', 'b'] in ['c', 'a', 'b'] False >>> ('a', 'b') in ('c', 'a', 'b') False For every stringish type S, S() (the empty value of that type) is a subsequence of every value of type S, and `in` reflects that truth. > Personally, I have never had a use for '' in 'some string' being True, and > always have to guard against that scenario. Don't know whether or not any of my code relies on it, but am pretty sure any relevant code would have already special-cased an empty string out any code path that would have tried to apply 'in` to it. Not particularly because it would be surprised by True, but because there's nothing useful it could _do_ with an empty string regardless (e.g., every slice of it would be equal to the original - an empty string is an end case for any string algorithm). > .. on the other hand, it seems that collections of related flags are often > treated > as in set theory, where the empty set is a member of any non-empty set. Not how any set theory I've ever seen works: a set S contains the empty set if and only if some member of S _is_ the empty set. Which is also how Python's frozenset.__contains__ works (a plain Python set can never contain the empty set, because only a frozenset can contain a set (empty or not)). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SSIRKW7TR5RUTOXLFWVGI7NAIZOHQKGH/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: The list.sort(reverse=True) method not consistent with description
[Raymond Bisdorff ] > I fully agree with your point. By default, all the components of the > tuple should be used in the comparison. > > Yet, I was confused by the following result. > >>> from operator import itemgetter > >>> L = [(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd'), (3, 'e')] > >>> L.sort(key=itemgetter(0), reverse=True) > >>> L = [(3, 'e'), (2, 'd'), (2, 'b'), (1, 'c'), (1, 'a')] > > Should the tuples comparison is in this case, I thought, not be solely > based on the first tuple component? I'm not sure what you're doing there. The last line in your example isn't showing the result of sorting, it's assigning a brand new list to `L`' Here the same thing, but changing the last line to show the result of sorting: >>> from operator import itemgetter >>> L = [(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd'), (3, 'e')] >>> L.sort(key=itemgetter(0), reverse=True) >>> L [(3, 'e'), (2, 'b'), (2, 'd'), (1, 'a'), (1, 'c')] So stability does matter when comparing only the tuples' first components, and tuples with the same first component did retain their original order. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ATYT7LDUZ4W422CFRDQJB7KHKSAFM674/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: The list.sort(reverse=True) method not consistent with description
[Raymond Bisdorff ] > ... > Please notice the following inconsistency in Python3.10.0 and before of > a sort(reverse=True) result: > > >>> L = [(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd'), (3, 'e')] > >>> L.sort(reverse=True) > >>> L > >>> [(3, 'e'), (2, 'd'), (2, 'b'), (1, 'c'), (1, 'a')] Looks good to me. > it should be: > > >>> L = [(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd'), (3, 'e')] > >>> reverseTuplesSort(L) > [(3, 'e'), (2, 'b'), (2, 'd'), (1, 'a'), (1, 'c')] Stability is irrelevant in the example, because no two list elements are equal. You appear to be thinking, perhaps, that s[0] == t[0] when s == (1, 'a') and t == (1, 'c') means s and t "are equal", but that's not so at all. >>> s = (1, 'a') >>> t = (1, 'c') >>> s == t False >>> s < t True >>> t > s True So s MUST come before t in a forward sort, and t MUST come before s in a reverse sort. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/V22SQQI4FC4XD4TNCCXWBLCOCH2W4XU3/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Packing a long list of numbers into memory
Sorry for the spam! A bunch of these were backed up in the moderation queue. I used the UI to set the list to auto-discard future messages from this address, but then clicked "Accept" in the mistaken sense of "yes, accept my request to auto-nuke this clown". But it took "Accept" to mean "sure thing, boss! I'll send this to everyone ASAP!!". Computer software. Everyone, please, get a useful job instead ;-) On Wed, Oct 20, 2021 at 10:43 PM wrote: > > that is great to see this post . https://bit.ly/3C551OO > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/5BMKWRYJVHTKLIW4MJM7J7SEYN2OXNJE/ > Code of Conduct: http://python.org/psf/codeofconduct/ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FZDBRF5D7ZBTXRUVJXPSBRFA2RRMI7LE/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: (Adaptive) Shivers sort variant of Tim sort
[Laurent Lyaudet ] > ... > My benchmarks could be improved but however I found that Shivers' sort > and adaptive Shivers' sort (aka Jugé's sort) performs better than > Tim's sort. Cool! Could you move this to the issue report already open on this? Replace list sorting merge_collapse()? https://bugs.python.org/issue34561 Alas, that's been inactive for nearly 3 years, but we _were_ making some decent progress before everyone got distracted by "more important" stuff. You'll see there that Vincent Jugé contributed another version, designed to address that his sort did substantially worse than what we already have in some cases with very few runs. Which, for reasons explained there, is likely important. Before that, the best alternative known was - and by far - Munro & Wild's "powersort" merge strategy. That's deeply rooted in theory about fast approximations to optimal binary search trees. Where that left off, Vincent was going to try proving properties of his new variant, plus investigate other ideas. Alas, that's where it ended, as Vincent ran out of time too. In any case, yes, I'm entirely in favor of replacing timsort's merge strategy, by something with provably good properties even in worst cases. As explained in the issue report, the merge strategy I made up was merely the first thing I tried that didn't obviously suck ;-) - almost all the work went into making merging itself as zippy as reasonably possible.. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CQ32BVLF66ITQ6NBSYCQUPB7TPAGD5AF/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Dropping out of this list
[me] > If you want more active moderation, volunteer for the job. I'd happily > give it up, and acknowledge that my laissez-faire moderation approach > is out of style. But, please, don't tell _me_ off-list that you volunteer. I want no say in who would become a new moderator - I'm already doing the job as I believe it's _best_ done. "Hands off" as much as possible. So if the community disagrees, they need to work out among themselves "the rules" for what should be suppressed, and pick moderators they believe will enforce those rules. I'm old, and cut my online discussion teeth via Usenet. It was total chaos, and I came to appreciate that deeply. Nobody was silenced by a central authority, but then again nobody could insist on being heard either. News readers quickly grew rather sophisticated notions of "killfiles", which allowed a user to quickly render posters they found useless, and/or toxic threads, and/or ... (anything that could be identified by a web of user-supplied regular expressions) completely invisible to them. The unsolvable problem I see with trying to ask a moderator to do that for everyone at once is that, to stop complaints about over-permissive moderation, the moderator would have to emulate the union of all list members' killfiles (had they ability to make their own), not their intersection. And then essentially no messages would be approved. You think I'm joking, but, e.g., back in the day I knew someone who killefiled every poster who had a surname that "sounded Armenian". And you better believe that _some_ people today would insist that _this_ message be suppressed because "killfile" evokes violence. Me, I'd consider suppressing it only because I generally find very little value in meta-posts, which this post _about_ python-dev moderation is ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OER42TVZSUC5YG6TT7TWI263UC7SYU4N/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Dropping out of this list
Various variations on: > ... I am also considering unsubscribing if someone doesn't step in and stop > the mess going on between Brett and Marco. ... Overall, "me too!" pile-ons _are_ "the [bulk of the] mess" to most list subscribers. It will die out on its own in time. Dr. Brett should know by now that he'll never get the "last word" with Dr. Marco, And everyone should know by now that Dr. Marco won't let anyone else get the last word either ;-) So stop feeding it. That's the practical thing to do. Like almost everything else in CPython-land, list moderation is a volunteer thing. I've been the most active (by far) list moderator here essentially forever, but I see my only _legitimate_ job as stopping spam. I might step in to block blatantly illegal posts too, but that hasn't yet come up. If you want more active moderation, volunteer for the job. I'd happily give it up, and acknowledge that my laissez-faire moderation approach is out of style. Else suit yourself: unsubscribe if you like, and come back whenever you like. It's all self-serve, and you're welcome here whenever "here" is welcome to you. As a matter of courtesy, though, it's less disruptive to the list membership if you just unsubscribe rather than announce to the 3,585 list members that you're considering unsubscribing ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZK5YV5SPYCGKGE3UX2BTRFAGMGRKIJIS/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Problems with dict subclassing performance
[Marco Sulla ] > I repeat, even the worst AI will understand from the context what I > meant. Amazingly enough, the truth value of a proposition does not increase via repetition ;-) >>> bool(True * 1_000_000_000) True >>> bool(False * 1_000_000_000) False > But let me do a very rude example: > What if I said to Steven "I pretend immediate suck my $%#$"? Do you > think you and the others will not understand the sense? :D Yes! But, for me, solely because of the "suck my $%#$" part. I still have NO idea what the "pretend immediate" part means to you. I see Chris spelled out, in some detail, what "pretend" means to native English speakers. None of which make sense in this context either, unless you're saying you're going to "make believe" that someone is going to "immediate suck [your] $%#$". But, in that case, what you choose to fantasize about doesn't really seem relevant either. "Pretend" just doesn't make any more sense here than, say, "hippopotamus" would. > C'Mon, you are offending my poor intelligence. > > As I wanted to say, I pretend from Steven his excuses for his > insinuation, immediately. Is this clear now, Not to me, no. Although I bet the new phrasing "insinuation" gets much closer to your intent than "excuses". You think Steven was indirectly accusing you of unethical behavior (trolling for StackOverflow upvotes)?. That's not the sense I got from his original reply, but I can understand it if you did. If that's your complaint, I'll leave it to Steven to say what his intent was - and, if appropriate, to apologize for unintended offense. > or must I take an English course at Cambridge? This isn't about advanced English usage. It's about the ordinary meanings of ordinary words in (what should be!) simple contexts. If I said to you Capisco con il pesce il nodo insolito! I doubt you'd suggest I study Italian at the University of Bologna ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4LWHML3I2UWU3Y2K6ZJ7RMP352JUDSTK/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Problems with dict subclassing performance
[Marco Sulla ] > It's the Netiquette, Chris. It's older than Internet. It's a gross > violation of the Netiquette remarking grammatical or syntactical > errors. I think that also the least advanced AI will understand what I > meant. As multiple people have said now, including me, they had no idea what you meant.by "I pretend your immediate excuses". It's not a complaint that it's expressed inelegantly, but that they can't make _any_ sense of it. By my count, this is at least the second time you've declined to explain what you meant, but instead implied the person who said they couldn't understand it was being dishonest. > I think anyway that now the sense of my request is __very clear__. I > ask the intervention of a moderator. How can I do this? Thank you in > advance. You already did, and I already replied. I don't view myself as a parent, referee, or judge here, and have never suppressed anyone for anything they said. I expect people to work out their differences among themselves, like adults. If you want to pursue this, I can only suggest bringing up a Code of Conduct complaint. You do that like so: https://www.python.org/psf/conduct/ """ If you believe that someone is violating the code of conduct, or have any other concerns, please contact a member of the Python Software Foundation Code of Conduct work group immediately. They can be reached by emailing conduct...@python.org """ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ND7WVOPRQUTXBR3K6XSAJ3C2YL23WVCR/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Problems with dict subclassing performance
[Marco Sulla ] > Oh, this is enough. The sense of the phrase was very clear and you all > have understood it. Sincerely, I have no idea what "I pretend your immediate excuses." means, in or out of context. > Remarking grammatical errors is a gross violation > of the Netiquette. I ask __immediately__ the intervent of a moderator, I'm the only active moderator on this list, and I expect everyone has noticed I don't exercise those powers at all except to try to keep spam from showing up to begin with. I had no problem at all with your original post, but, as usual, _personally_ find almost no value in meta-posts (posts _about_ posts). People make them anyway, and that's OK by me. They eventually burn themselves out. I would _like_ it most if everyone dropped this thread now, unless they want to say something about the original topic (dict subclass performance). But, towards that end, I'm not going to threaten anyone with any moderation action. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/T5Z35KJLY3RQGS2HXYHLTR3QFURO34RF/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Windows buildbots may be broken
Sorry, all! This post was pure spam - I clicked the wrong button on the moderator UI. The list has already been set to auto-reject any future posts from this member. On Mon, Aug 9, 2021 at 10:51 AM ridhimaortiz--- via Python-Dev wrote: > > It is really nice post. https://bit.ly/3fsxwwl > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/WMJOVXKRH2ILPADU5QMLIVEIXHWPBOLZ/ > Code of Conduct: http://python.org/psf/codeofconduct/ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/INKI3S6ZZTQDVXZ3Y5BIJJO6QMGQJ5EK/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: [slightly OT] cryptographically strong random.SystemRandom()
[Ethan Furman] > A question [1] has arisen about the viability of `random.SystemRandom` in > Pythons before and after the secrets module was introduced > (3.5 I think) -- specifically > > does it give independent and uniform discrete distribution for > cryptographic purposes across CPython 3.x versions? > > ,,, > [1] https://stackoverflow.com/q/68319071/208880 `secrets` is just a wrapper around `random.SystemRandom`, so the presence or absence of `secrets` doesn't matter. As to SystemRandom, all answers depend on the quality of the platform os.urandom(), which Python has no control over. See my answer here, and the comments on it: https://stackoverflow.com/questions/20936993/how-can-i-create-a-random-number-that-is-cryptographically-secure-in-python/20937265#20937265 ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7TZWAZCNFC4ZMRQENWQ5DDIFU6QPGN4Z/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Why list.sort() uses mergesort and not timsort?
[Dan Stromberg ] > ... > Timsort added the innovation of making mergesort in-place, plus a little > (though already common) O(*n^2) sorting for small sublists. Actually, both were already very common in mergesorts. "timsort" is much more a work of engineering than of insight ;-) That is, it combined many pre-existing ideas, but pushed them hard, and got them to work smoothly together. The most novel part is "galloping" to vastly cut the number of comparisons needed in many real-world cases of partially ordered inputs. I took that idea from (as briefly explained in listsort.txt) papers seeking to speed real-life unions and intersections of sorted lists, presumably in database implementations. I later discovered that the idea had already been applied to a mergesort, as noted in a paper by McIlroy (cited in listsort.txt) - but the paper didn't give any details, and best I can tell he never published his code. WIthout that, timsort wouldn't have been worth the bother of writing - it was generally no faster than Python's prior sort implementation on randomly-ordered inputs, and is quite a large pile of code. But many cases of real-world inputs do have significant pre-existing order of some kind, where even a "perfect" quicksort (always picks _the_ median as the partition pivot) remains O(n log n) but timsort O(n). Curiously, though, it was the only implementation ever tried of a mergesort-based algorithm that wasn't notably _slower_ on randomly-ordered inputs than Python's prior "samplesort" algorithm (like quicksort but with a very high-quality guess for the median element based on recursively sample-sorting a largish random sample). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Y22TRSA5I4JCNC6GRBX7QZYF4MLGPCYK/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Anyone else gotten bizarre personal replies to mailing list posts?
FYI, I just force-unsubscribed this member (Hoi Lam Poon) from python-dev. Normally I don't do things like that, since, e.g, we have no way to know whether the sender address was spoofed in emails we get. But in this case Hoi's name has come up several times as the sender of individual spam, and if Hoi is actually a legit member they should have noticed and spoken up. Of course this can't stop them from spamming. Just intended to increase the effort for them, if they're determined to continue. [nothing new below[ On Tue, May 4, 2021 at 11:34 AM Jonathan Goble wrote: > > I just got one now from the same person on the dependabot thread. Entirely in > Chinese except for "GNU license" in the middle, along with an attached jpeg > of a random SciPy-related tweet (according to Gmail's preview thumbnail; I > didn't actually open the attachment for obvious reasons). > > On Sun, May 2, 2021, 5:08 AM Jeff Allen wrote: >> >> Yes, I got one from the same address today. Thanks for pointing out these >> are individual peformances: it was annoying when I thought it was spam to >> the list. >> >> Although Hoi Lam Poon is a real (female) name, it may signify a generated >> lampoon. >> >> Jeff Allen >> >> On 23/04/2021 16:38, Nathaniel Smith wrote: >> >> I just got the reply below sent directly to my personal account, and I'm >> confused about what's going on. If it's just a one off I'll chalk it up to >> random internet weirdness, but if other folks are getting these too it might >> be something the list admins should look into? Or... something? >> >> -- Forwarded message - >> From: Hoi lam Poon >> >> ___ >> Python-Dev mailing list -- python-dev@python.org >> To unsubscribe send an email to python-dev-le...@python.org >> https://mail.python.org/mailman3/lists/python-dev.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-dev@python.org/message/3EA53I3Y72ACEPDVG467NMNTXHRL3NXL/ >> Code of Conduct: http://python.org/psf/codeofconduct/ > > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/JNYRFZ4FTIFUJFS7FPNMKI2HXW7SLL52/ > Code of Conduct: http://python.org/psf/codeofconduct/ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LFULZM6GLDUAS5SFQ4F454QMRGEH3EGH/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: On the migration from master to main
I'm guessing it's time to fiddle local CPython clones to account for master->main renaming now? If so, I've seen two blobs of instructions, which are very similar but not identical: Blob 1 ("origin"): """ You just need to update your local clone after the branch name changes. >From the local clone of the repository on a computer, run the following commands to update the name of the default branch. $ git branch -m master main $ git fetch origin $ git branch -u origin/main main Apart from that, you should update any local script or command that uses the name "master" to use the name "main". """ Blob 2 ("upstream"): """ The CPython repository's default branch was renamed from ``master`` to ``main`` after the Python 3.10b1 release. If you had cloned the repository before this change, you can rename your local branch as follows:: git branch -m master main git fetch upstream git branch -u upstream/main main """ >From my dim understanding, "upstream" makes more sense, but I don't know. On my box: """ $ git remote -v origin https://github.com/tim-one/cpython (fetch) origin https://github.com/tim-one/cpython (push) upstreamhttps://github.com/python/cpython (fetch) upstreamhttps://github.com/python/cpython (push) """ so sync\ing with the universally shared "upstream" just makes more sense to me. Right? Wrong? Do we need some mix of both? Something else? I'd rather get it right the first time than try to untangle a mess after the fact ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/X3V42DYOHFLGAUPODFLXMYZWDE3VY47W/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PyGC and PyObject_Malloc introspection
[Julien Danjou] > ... > Supposedly PyObject_Malloc() returns some memory space to store a > PyObject. If that was true all the time, that would allow anyone to > introspect the allocated memory and understand why it's being used. > > Unfortunately, this is not the case. Objects whose types are tracked by > the GC go through _PyObject_GC_Alloc() which changes the underlying > memory structure to be (PyGC_HEAD + PyObject). > > This is a bummer as there then no safe way that I can think of to know > if an allocated memory space is gc-tracked or gc-untracked. It makes it > therefore impossible to introspect the memory allocated by > PyObject_Malloc(). I'm not clear on exactly what it is you're after, but CPython faces the same question all the time: _given_ a pointer to an object, is there, or is there not, a GC header prepended? That's answered by this C API function: """ int PyObject_IS_GC(PyObject *obj) Returns non-zero if the object implements the garbage collector protocol, otherwise returns 0. The object cannot be tracked by the garbage collector if this function returns 0. """ FYI, the implementation usually resolves it by looking at whether obj's type object has the Py_TPFLAGS_HAVE_GC flag set. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LBYGKGEPB5ZAFEQGOBIICNEDRYK4PIT3/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Tim] > ... > Alas, the higher preprocessing costs leave the current PR slower in "too > many" cases too, especially when the needle is short and found early > in the haystack. Then any preprocessing cost approaches a pure waste > of time. But that was this morning. Since then, Dennis changed the PR to back off to the current code when the needle is "too small". There are very few cases we know of now where the PR code is slower at all than the current code, none where it's dramatically slower, many where it's significantly faster, and some non-contrived cases where it's dramatically faster (for example, over a factor of 10 in stringbench.py's "late match, 100 characters" forward-search tests, and especially beneficial for Unicode (as opposed to bytes)). Then there are the pathological cases like in the original issue report, where it's multiple orders of magnitude faster (about 3 1/2 hours to under a tenth of a second in the issue report's case). Still waiting for someone who thinks string search speed is critical in their real app to give it a try. In the absence of that, I endorse merging this. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5K2ISKORAH327WLQ43QG5QMPWXNHMFTC/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Tim] >> Note that no "extra" storage is needed to exploit this. No character >> lookups, no extra expenses in time or space of any kind. Just "if we >> mismatch on the k'th try, we can jump ahead k positions". [Antoine Pitrou ] > Ok, so that means that on a N-character haystack, it'll always do at > least N character comparisons? > > IIRC, the current algorithm is able to do much less than that in > favorable cases. If the needle is k characters long and they are all > distinct, it's able to do N/k comparisons. It's an excellent question, and from what I said it's a correct inference. But I only described how the algorithm matches the "v" part of the "needle = u + v" splitting. When matching the "u" part, skips materially exceeding the count of comparisons made are possible. The paper claims the minimal number of comparisons needed (ignoring preprocessing) is 2N/k, so same order as the current algorithm. But, for the constant factor, Dennis's code achieves N/k, because he augmented the Crochemre-Perrin algorithm with a variant of Sunday's algorithm (a simplification, but not as extreme as the current code's simplification). Note that best case near N/k is a known lower bound for best cases - impossible to do materially better. In the 'x' * 100 + 'y' example I ended my msg with, recall that v wa just "y", so the part I described was of minimum value - if the last haystack character in the current search window isn't "y", that the mismatch occurred on the first try leads to a shift of just 1. But if the character one beyond the end of the current search window isn't "x" or "y" then, Sunday's algorithm allows to skip 102 positions (len(needle) + 1). His algorithm requires precomputing a skip-count table of O(len(sigma)) space, where sigma is the universe of possible characters ("the alphabet"). That's "a whole lot" for Unicode. Fredrik and Dennis simplified Sunday's algorithm to weaker versions that require relatively small constant space instead, independent of needle, pattern, and alphabet size. Despite the current code's simplification of that nearly to the point of non-existence (it reduces the space needed to 32 bits, and with only one possible skip count), it's nevertheless so effective that it beat Dennis's initially pure Crochemre-Perrin code by dramatic margins in a significant number of exceptionally lucky (for the current code) cases. After adding a variation of Sunday (the current PR uses 256 bytes, and has 64K possible skip counts - but perhaps other tradeoffs would work better), we haven't yet seen a case (contrived or natural) where the current code is dramatically faster. But we have seen non-contrived cases where the current PR is dramatically faster, including in the benchmark code Fredrik gifted us with (Tools/stringbench/stringbench.py in your Python tree). Alas, the higher preprocessing costs leave the current PR slower in "too many" cases too, especially when the needle is short and found early in the haystack. Then any preprocessing cost approaches a pure waste of time. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MHGNAZH7X4HS2KSWTRDVSX2AJXN7POXZ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Tim Peters, explains one of the new algorithm's surprisingly effective moving parts] [Chris Angelico ] > Thank you, great explanation. Can this be added to the source code > if/when this algorithm gets implemented? No ;-) While I enjoy trying to make hard things clear(er), I need to understand them inside out myself first, and I'm still not there with this algorithm. Even of what I do grok now, I cheated some to make that post simple. For example, if you read it carefully, you'll realize that the explanation I gave doesn't actually explain why the "x" * 100 example is correct: the explanation implicitly relied on that the left half the splitting (u) is at least as long as the right half (v). But that's not always the case (and, in particular, for "x" * 100, u is empty!). Tal Einat posted earlier that he was keen to try to work up a clear explanation, and I look forward to that. All the expositions I've found of this algorithm so far are hard to approach. The original paper is still the best I've seen. Everything I wrote was but a gloss on its Lemma 3.2. > I've learned all kinds of things from the huge block comments > about timsort, list growth, and the like. They make great reading > when you're trying to figure out how to explain things (or if > you're a bored nerd browsing for curiosity's sake - that works too!). Takes one to know one ;-) There's beauty and elegance in this algorithm, but they'll go unloved without explanations clear enough for the non-obsessed to follow. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HMSOQD22CLHLZ4VEWV6FPRA2RC3TM75H/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
I don't plan on making a series of these posts, just this one, to give people _some_ insight into why the new algorithm gets systematic benefits the current algorithm can't. It splits the needle into two pieces, u and v, very carefully selected by subtle linear-time needle preprocessing (and it's quite a chore to prove that a suitable splitting always exists!): needle == u + v There are a whole bunch of things that can be said about this splitting (and must be said to prove worst-case linear time), but here's just one for which it's easy to show the benefit: no (non-empty) prefix of v is a suffix of u. Now I think it's quite easy to twist your head into knots trying to see what follows from that, so I'll just give a highly generalizable concrete example. Suppose v starts with "xyza". Then u cannot end with "x", "xy", "xyz", or "xyza". If u did, then u would have a suffix that's also a prefix of v. A match attempt at a given haystack position starts by matching the characters in v one at a time, left to right. Here with haystack on the first line, needle on the second, and "0" denoting "don't care / unknown / doesn't exist" - it's just _some_ glyph so the example has some chance of lining up under annoying proportional fonts :-( Say the first character matches: 0x00 0xyzavvv And the second and third: 0xyz 0xyzavvv But the 4th doesn't: 0xyzZ000 0xyzavvv Is there any possibility of a match if we shift the needle one to the right? No! If we tried, the last character of u would line up with the "x" in the haystack, and we already know "x" is not a suffix of u. How about if we shifted two to the right? No again! And for the same reason: then the last two characters of u would line up with "xy" in the haystack, but we already know "xy" is not a suffix of u. And so on. In general, if we find a mismatch while trying the k'th (1-based) character of v, we can shift the needle k places to the right before there's any possibility of matching. In this case, k=4, so we can jump to: 0xyzA000 0xyzavvv Note that no "extra" storage is needed to exploit this. No character lookups, no extra expenses in time or space of any kind. Just "if we mismatch on the k'th try, we can jump ahead k positions". Nothing special about "xyza" either - this works with any characters at all! "xyza" isn't by any stretch a hard case regardless. But it goes _some_ way toward hinting at why potentially hard cases are rendered toothless too. For example, the needle "x" * 100 is split into u = "" # yup, empty! v = needle Now, e.g., if we find a mismatch at the 83'rd character of v, we can leap ahead 83 positions immediately. What about needle "x" * 100 + "y"? I picked that to crush your premature optimism ;-) Now the splitting is u = "x" * 100 v = "y" and the wonderful trick above only applies to a trivial 1-character string. The "hard part" is all in matching the u piece now, which I'll leave as an exercise for the reader ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HPMYTQSJ4IXUYUCZE2EYCIF34RTQDW7O/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Marco Sulla] > Excuse me if I intrude in an algorithm that I have not understood, but > the new optimization can be applied to regexps too? The algorithm is limited to searching for fixed strings. However, _part_ of our regexp implementation (the bit that looks ahead for a fixed string) will inherit it by magic. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z4R6VUAOXMH2D5NDGWZX52BLU6F7XNXE/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Guido] > I am not able to dream up any hard cases -- like other posters, > my own use of substring search is usually looking for a short > string in a relatively short piece of text. I doubt even the current > optimizations matter to my uses. I should have responded to this part differently. What I said was fine ;-) , but it's a mistake to focus exclusively on pathologies here. What Fredrik did was with the aim of significantly speeding utterly ordinary searches. For example, search for "xyz" in "abcdefgh...xyz". Brute force starts comparing at the first location: abcdefgh... xyz The current code compares "c" with "z" fist. They don't match. Now what? It looks at the_next_ character in the haystack, "d". Thanks to preprocessing the needle, it knows that "d" appears nowhere in the needle (actually, the code is so keen on "tiny constant extra space" that preprocessing only saves enough info to get a _probabilitistic_ guess about whether "d" is in the needle, but one that's always correct when "not in the needle" is its guess). For that reason, there's no possible match when starting the attempt at _any_ position that leaves "d" overlapping with the needle. So it can immediately skip even trying starting at text indices 1, 2, or 3. It can jump immediately to try next at index 4: abcdefgh... 0123xyz Then the same kind of thing, seeing that "g" and "z" don't match, and that "h" isn't in the needle. And so on, jumping 4 positions at a time until finally hitting "xyz" at the end of the haystack. The new code gets similar kinds of benefits in _many_ similarly ordinary searches too, but they're harder to explain because they're not based on intuitive tricks explicitly designed to speed ordinary cases. They're more happy consequences of making pathological cases impossible. >From randomized tests so far, it's already clear that the new code is finding more of this nature to exploit in non-pathological cases than the current code. Although that's partly (but only partly) a consequence of Dennis augmenting the new algorithm with a more powerful version of the specific trick explained above (which is an extreme simplification of Daniel Sunday's algorithm, which was also aimed at speeding ordinary searches). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Y3DFFBHNMHDGRE2GIEMH7XLY5YR6BMKR/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Dennis Sweeney ] > Here's my attempt at some heuristic motivation: Thanks, Dennis! It helps. One gloss: > > The key insight though is that the worst strings are still > "periodic enough", and if we have two different patterns going on, > then we can intentionally split them apart. The amazing (to me) thing is that splitting into JUST two parts is always enough to guarantee linearity. What if there are a million different patterns going on ? Doesn't matter! I assume this remarkable outcome is a consequence of the Critical Factorization Theorem. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/IKPKLR43U522PC55JB7GQZMQSGJHNCKF/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Guido] > The key seems to be: Except none of that quoted text (which I'll skip repeating) gives the slightest clue as to _why_ it may be an improvement. So you split the needle into two pieces. So what? What's the _point_? Why would someone even imagine that might help? Why is one half then searched right to left, but the other half left to right? There's actually "a reason" for searching the right half left to right, but because the shift on a mismatch in the left half is a constant ("per(x)") independent of the mismatching position, it's actually irrelevant in which order the left half's characters are compared (well, at least in some variations of these newer algorithms). Over-specification can be actively misleading too. What's the key idea? "Split the needle into two pieces" is a key _step_, not a key _insight_. > I need a Python version though. Dennis wrote one for his first stab, which you can download from the bug report (it's attached there as fastsearch.py). On the bug report's data, running that under CPython 3.9,0 on my box reduced the time from about 3 1/2 hours to about 20 seconds. Running it under PyPy, to under one second (and the current C implementation under 0.1 second). > I am not able to dream up any hard cases -- like other posters, my own > use of substring search is usually looking for a short string in a relatively > short piece of text. I doubt even the current optimizations matter to my uses. They probably do, but not dramatically. Would you, e.g., notice a factor of 2 either way in those cases? A factor of 4? Of 10? When Fredrik first wrote this code, he routinely saw factors of 2 to 10 speedup on all sorts of string-searching benchmarks, both contrived and "natural". The speedups were especially pronounced for Unicode strings at the time, where skipping futile Unicode character comparisons could be more valuable than when skipping (shorter) single-byte comparisons. Should also note that the fixed string searching code is also used as a subroutine by parts of our regexp implementation, by str.count(), str.replace(), similar `bytes` operations, and so on. > What are typical hard cases used for? It's kinda like asking what typical hard rounding cases for pow() are used for ;-) They aren't deliberate. They "just happen", typically when a pattern and the text to search contain self-similarities "but with glitches". Search the string text = "X" * 10_000_000 for needle = "X" * 1_000_000 Easy! It matches at once. But now tack "Y" on to the end of the needle. Then it's impossible to match. Brute force search first finds a prefix match at text{; 100] but then fails to match the trailing "Y". So brute force uses another million compares to match the prefix at text[1 : 101]. But again fails to match the trailing "Y". Then another million plus 1 compares to fail to match starting at index 2, and again at index 3, and again at index 4, ... it approaches a trillion comparisons before it finally fails. The current code starts comparing at the right end of each trial position first. Then an "X" from the text and the needle's trailing "Y" mismatch at once. That's a million times faster. Here's a simple case where the current code is horribly slow, because neither "right-to-left" nor any of its preprocessing tricks do any good at all. Even Daniel Sunday's full-blown algorithm[1] (which the current code strips down _almost_ to the point of non-existence) wouldn't help (unless it used a variant I've never actually seen that compared the rarest pattern character first): ("X" * 10_000_000).find("X" * 500_000 + "Y" + "X" * 500_000) The newer code returns -1 seemingly instantly. > DNA search? (That would be cool!) It can certainly help. Self-similarities are bound to be common in long strings from a 4-character alphabet (ACGT). But, to be fair, serious work with genomes typically invests in building a giant suffix tree (or enhanced suffix array) for each cataloged sequence to be searched. No preprocessing of needles is needed then to guarantee worst-case efficient searches of many kinds (including things like finding the longest adjacent repeated subsequences, more like a regexp (.+)\1 kind of search). But the new code could quite plausibly speed more casual prototyping of such explorations. [1] https://dl.acm.org/doi/abs/10.1145/79173.79184 ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XUP6S2IVJAW5K2NSHZ7UOKN5YQFNUWVQ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Steven D'Aprano ] > Perhaps this is a silly suggestion, but could we offer this as an > external function in the stdlib rather than a string method? > > Leave it up to the user to decide whether or not their data best suits > the find method or the new search function. It sounds like we can offer > some rough heuristics, but the only way to really know is "try it and > see which works best for you". > > The `string` module is an obvious place for it. I think this is premature. There is almost never an optimization that's a pure win in all cases. For example, on some platforms `timsort` will never be as fast as the old samplesort in cases with a very large number of equal elements, and on all platforms `timsort` consumes more memory. And it's a wash on "random" data on most platforms. Nevertheless, it was a significant speed win for many - possibly even most - real-life cases. So far, the PR reduces the runtime in the bug report from about 3 1/2 hours to under a tenth of a second. It would be crazy not to gift users such dramatic relief by default, unless there are good reasons not to. Too soon to say. On tests of random strings with characters following a Zipf distribution (more realistic than uniform but still very easy to generate - and not contrived in any way to favor either approach), the current new code it usually faster than the status quo, in no case has been worse than twice as slow, and in a number of cases has been 20x faster. It's already clearly a win _overall_. The bulk of the "but it's slower" cases now are found with very short needles (patterns), which was expected (read my comments on the bug report), and exacerbated by that the structure of the random test generation is quite likely to create cases where a short needle is found early in the haystack. This combination makes the higher preprocessing overhead as damaging as possible. Dennis also expanded the current code's "32-bit bitmask" trick (which is an extreme simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte table of character-class-specific skip counts, which went a _very_ long way toward eliminating the cases where the current code enjoyed a "pure luck" advantage over the new code. But that increased the preprocessing overhead again, which again is relatively most significant for short needles - those 256 bytes have to be initialized every time, no matter the lengths of the needle or haystack. If the new code were changed to exempt short needles, even now it might be hard to make an objective case not to adopt it. But it would leave open a different idea for an "opt-in" facility: one that allowed to give a needle to a function and get back an object capturing the results of preprocessing. Then a different wrapper around the search code that accepted the already-pre-processed info. There are, e.g., certainly cases where repeated searches for a given 4-character needle could be sped significantly by exploiting the new code, but where the overhead of preprocessing is too much to bear in "random" performance testing. It would also open the door to more aggressive (expensive) kinds of preprocessing. That's where users may be able to make useful, informed judgments. [David Mertz] > That said, I do recognize that the `re` module also has pathological cases, > and the standard library documentation explicitly says "maybe you want to > try the third-party `regex` implementation." That is sort of precedent for > this approach. `regex` also suffers exponential time disasters, they're just harder to stumble into - and there are books explaining in detail how and why regex engines can fall into these traps. It's far less _expected_ in plain string searches, and Dennis was aware of the new algorithms because, apparently, (at least) glibc's memmem switched to one some years ago. So we're playing catch-up here. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECPXBBF7OUNYLDURCUKYXIOTGPGVBMHX/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
[Guido] > Maybe someone reading this can finish the Wikipedia page on > Two-Way Search? The code example trails off with a function with > some incomprehensible remarks and then a TODO.. Yes, the Wikipedia page is worse than useless in its current state, although some of the references it lists are helpful. This is a much better summary: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 but, I believe, still far too telegraphic. The original paper is by far the best account I've seen so far, with complete code and exhaustive explanations and proofs. Even examples ;-) But here's the thing: I don't believe this algorithm this _can_ be reduced to an elevator pitch. Excruciating details appear to be fundamental at every step, and I've seen nothing yet anywhere approaching an "intuitive explanation" :-( What happens instead is that you run it on the hardest cases you can dream up, and your jaw drops in amazement :-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/B2MPSYXRKFA3V3GP45GAFSIKKCK5NHJ3/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Changing Python's string search algorithms
Rest assured that Dennis is aware of that pragmatics may change for shorter needles. The code has always made a special-case of 1-character needles, because it's impossible "even in theory" to improve over straightforward brute force search then. Say the length of the text to search is `t`, and the length of the pattern `p`. Brute force and the current code have worst case O(t*p) behavior. The new code, worst case O(t+p). If that's as deep as you dig into it, it seems all but obvious then that O(t*p) just can't be that bad when p is small, so keep it simple. But there's another twist: the current and new code both have O(t/p) cases too (but brute force, and even Knuth-Morris-Pratt, don't). That can be highly beneficial even for p as small as 3. Unfortunately, the exact cases in which the current and new code enjoy O(t/p) behavior aren't the same. Lying a bit: In general the current code has just two tricks. One of those tricks is useless (pure waste of time) if the pattern happens to end with a repeated character pair, and is most effective if the last character appears nowhere else in the pattern. The other trick is most effective if the set of characters in the text has no intersection with the set of characters in the pattern (note that the search is certain to fail then), but useless if the set of text characters is a subset of the set of pattern characters (as, e.g., it very often is in real-life searches in apps with a small alphabet, like [ACGT}+ for genomes). But I don't know how to characterize when the new code gets maximum benefit. It's not based on intuitive tricks. The paper that introduced it[1] says it's based on "a deep theorem on words known as the Critical Factorization Theorem due to Cesari, Duval, Vincent, and Lothaire", and I still don't fully understand why it works. It's clear enough, though, that the new code gets real systematic value out of patterns with repetitions (like "abab"), where the current code gets real value from those only by accident (e.g., "a" and "b" happen to appear rarely in the text being searched, in which case that the pattern has repetitions is irrelevant). But, as I said in the first message, the PR is "preliminary". There are still worlds of tweaks that have been thought of but not yet tried even once. [1] search for "Two-Way String Matching" by Crochemore and Perrin. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYWWEM26S2XKVX5W6P54R47YQTG/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Changing Python's string search algorithms
Fredrik Lundh crafted our current string search algorithms, and they've served us very well. They're nearly always as fast as dumbest-possible brute force search, and sometimes much faster. This was bought with some very cheap one-pass preprocessing of the pattern (the substring to search _for_), to compute a single integer "skip", and a single 32-bit bitmask, which can sometimes be used to skip over areas where it can prove matches are doomed to fail. Unlike most "fancier than brute" algorithms, it does not build preprocessing tables of size proportional to the length of the pattern, or to the size of the alphabet. The extra space is O(1), and tiny, independent of pattern and alphabet size. A weakness is that worst-case search time is proportional to the product of the lengths of the inputs. This can be so for searches that succeed as well as for those that fail. This was recently dramatically illustrated by this bug report: https://bugs.python.org/issue41972 where a single large string search consumed well over 3 hours of CPU time before it succeeded. It "should take" a fraction of a second. While it was news to me, Dennis Sweeney was aware of that other algorithms requiring only O(1) extra space are now well known with worst-case linear time. That's really quite remarkable :-) They do require fancier (but still linear-time) preprocessing, so can't always be a pure win. So this message: string (which includes byte) search is an extremely common operation, across all kinds of applications, so changes here need as many eyeballs on them as they can get. What's typical? What's important? What doesn't matter? What about reverse searches ("rfind")? How can we quantify the acceptable region of tradeoffs? My guess is that searches for substrings less than 50 characters long in strings under ten thousand characters are most important for most apps, but I just pulled that guess out of my butt. I'm also guessing that searches for multi-million-character substrings (like in the bug report above) are rare indeed, and that any change sparing them from quadratic-time disaster would be "way more than good enough". But I don't know. If you can help, Dennis already has a preliminary PR implementing one of the newer algorithms, augmented with Fredrik's unique ideas: https://github.com/python/cpython/pull/22679 By "help" I don't necessarily mean code review - it could, e.g., be a huge help to test-drive it on your own critical string-slinging apps. Faster? Slower? A wash? Wrong answer? Crash? Due to the natures of the old and new algorithms, neither will be faster or slower in all cases, the newer one should never be dramatically slower than the old one, the new one will be spectacularly faster in some cases, and "in general" it seems impossible to guess in advance. It depends on the value the fancier preprocessing can extract from the pattern versus the higher preprocessing cost, and that in turn depends on details of exactly what the patterns and texts to be searched are. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECAZN35JCEE67ZVYHALRXDP4FILGR53Y/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Deferred, coalescing, and other very recent reference counting optimization
I'm surprised nobody has mentioned this: there are no "unboxed" types in CPython - in effect, every object user code creates is allocated from the heap. Even, e.g., integers and floats. So even non-contrived code can create garbage at a ferocious rate. For example, think about this simple function, which naively computes the Euclidean distance between two n-vectors: ```python def dist(xs, ys): from math import sqrt if len(xs) != len(ys): raise ValueError("inputs must have same length") return sqrt(sum((x - y)**2 for x, y in zip(xs, ys))) ``` In general, `len(xs)` and `len(ys)` each create a new integer object, which both become trash the instant `!=` completes. Say the common length is `n`. `zip` then creates `n` 2-tuple objects, each of which lives only long enough to be unpacked into `x` and `y`. The the result of `x-y` lives only long enough to be squared, and the result of that lives only long enough to be added into `sum()`'s internal accumulator. Finally, the grand total lives only long enough to extract its square root. With "immediate" reclamation of garbage via refcounting, memory use is trival regardless of how large `n` is, as CPython reuses the same heap space over & over & over, ASAP. The space for each 2-tuple is reclaimed before `x-y` is computed, the space for that is reclaimed when the square completes, and the space for the square is reclaimed right after `sum()` folds it in. It's memory-efficient and cache-friendly "in the small". Of course that's _assuming__, e.g., that `(x-y).__pow__(2)` doesn't save away its arguments somewhere that outlives the method call, but the compiler has no way to know whether it does. The only thing it can assume about the element types is that they support the methods the code invokes. It fact, the compiler has no idea whether the result type of `x-y` is even the same as the type of `x` or of `y`. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NU4T5TFPDVYDCR5ADY6KKJ6USWVFD3TZ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PEP 622 version 2 (Structural Pattern Matching)
One microscopic point: [Guido] > ... > (if `.x` is unacceptable, it’s unclear why `^x` would be any > better), As Python's self-appointed spokesperson for the elderly, there's one very clear difference: a leading "." is - literally - one microscopic point, all but invisible. A leading caret is far easier to see, on a variety of devices and using a variety of fonts. Indeed, I missed the leading dot in ".x" in your email the first two times I read that sentence. But a caret is harder to type. So here's an off-the-wall idea: use an ellipsis. If you're still using a maximal-munch lexer, ellipsis followed by an identifier is currently a syntax error. "...x" is far easier to see than ".x', easier to type than "^x", and retains the mnemonic connection that "something is a named load pattern if and only if it has dots". "..." is also a mnemonic for "OK, here I want to match ... umm ... let me think ... I know! A fixed value." ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TPTECZKGQY53L7ZKIQOVOXOTMQE6A447/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Recent PEP-8 change
[Paul Moore ] > (This is a genuine question, and I'm terrified of being yelled at for > asking it, which gives an idea of the way this thread has gone - but I > genuinely do want to know, to try to improve my own writing). > > What *is* the correct inclusive way to refer to an unidentified person > in a technical document, without sacrificing clarity by using > convoluted circumlocutions like "he/her/they" or over-use of the > passive voice? My impression is that commonly accepted language rules > and usage are lagging behind, and there's no good answer to this > question yet :-( I've used singular "they" for years'; if someone truly objects to that, they've kept it to themself. "Authorities" are moving in that direction too; e.g., https://apastyle.apa.org/style-grammar-guidelines/grammar/singular-they """ The singular “they” is a generic third-person singular pronoun in English. Use of the singular “they” is endorsed as part of APA Style because it is inclusive of all people and helps writers avoid making assumptions about gender. Although usage of the singular “they” was once discouraged in academic writing, many advocacy groups and publishers have accepted and endorsed it, including Merriam-Webster’s Dictionary. """ Then again, we're talking about humans. There's nothing you can do - or refrain from doing - that won't mortally offend someone :-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4MOBYZJUEMUKKI2GELZNXTRMT7QPRGR3/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Please refrain from posting on the PEP 8 threads for 24 hours
[Victor Stinner ] > If someone continues to feed the PEP 8 discussion, would it be > possible to change their account to require moderation for 1 day or > maybe even up to 1 week? I know that Mailman 3 makes it possible. I see no such capability. I could, for example, manually fiddle things so that messages from your email address are held for moderation, but that would persist until I manually undid it egain. But there seem to be many things Mailman 3 is theoretically capable of that aren't available via the list owner web UI. That's the only admin access I have (or want). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSIQ36ZYGUMPCYXUOABBV6TCDVU2H467/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Please refrain from posting on the PEP 8 threads for 24 hours
[Brett Cannon wrote:] > Regardless of what side you fall on, I think we can agree that > emotions are running very high at the moment. Nothing is going > to change in at least the next 24 hours, so I am personally > asking folks to step back for at least that long and think about: > > Is what you want to say going to contribute to the discussion? > Is it being phrased in a polite, constructive fashion? > > If after 24 hours of reflection you feel your email still meets these > criteria then I would say it's reasonable to send that email. As python-dev's primary moderator, I endorse and second Brett's request. No, I'm not slapping blanket moderation on the list unless forced to. This is just a sensible adult-to-adult request to cool down and restore at least a shallow illusion;-) of mutual civility. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HB22ZQECC5DJPIOFXHHDUAMGD6YN6WFO/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Lists placed into Emergency Moderation status
[Ernest W. Durbin III ] >>> Reviewing, I may have misinterpreted the message from PSF Executive >>> Director regarding the situation. >>> >>> It does appear that python-ideas moderators contacted postmaster@. >>> Appears I misread a message saying “it looks like it’s happening on >>> python-dev too” to mean that the request was for both lists. [Tim Peters] >> It depends on who you want to annoy least ;-) >> >> If it's the position of the PSF that some kind(s) of messages must be >> suppressed, then I'll need a more-or-less clear definition, from them, >> of what those consist of. At which point I'll agree to enforce them, >> or step down as a python-dev moderator. [Thomas Wouters ] > Just to clarify: this had nothing to do with the PSF, but with the > Steering Council My understanding is that the "PSF Executive Director" Ernest cited has nothing to do with the Steering Council, but with the PSF ;-) In any case, that's why I said "if it's the position of the PSF ...". > and the python-ideas moderators. We were discussing the PEP-8 "controversy" > last night, after Titus (one of the python-ideas moderators) asked about > putting < *that* list in full moderation, and we talked about how the discussion had bled > over into python-dev. We weren't clear among ourselves which lists we were > talking about anymore when we asked Ernest So it wasn't the PSF Executive Director. Regardless, same thing in the end to me: if someone who claims authority to do this wants to enforce some content suppression rules on python-dev, that's a business I'm very reluctant to get into, but can't judge without a clearer definition of what's out of bounds. I do OK with "spam" and "wholly off topic", but Python people telling each other their attitudes, beliefs, and/or behaviors are disgusting, unacceptable, evil ... strikes me as being a healthy (if unpleasant) family fight. I won't take a side on that in the moderation role. > to see about emergency moderation mode (which wasn't as easy as we > would've expected), To judge from the change I undid myself for python-dev, it appeared to amount to no more than putting ".*" in a regexp field to match all Subject lines. (To be clear, I undid this for python-dev; Ernest did not, so blame me - as the night wore on, I wanted to kill the change before I went to sleep, so people posting overnight (from my POV) wouldn't have to wait hours & hours for posts to go through.) > which is why Ernest did it for both lists. We should've been more careful > about this, and looped in the python-dev moderators straight away. Sorry > about that. Not a real problem! Just another minor computer-related busy-work annoyance, in a life overflowing with such for decade after decade after decade ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AAXZP3GD6PWW36FNQLHUD545NM7W6MX4/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Lists placed into Emergency Moderation status
[Ernest W. Durbin III ] > Reviewing, I may have misinterpreted the message from PSF Executive > Director regarding the situation. > > It does appear that python-ideas moderators contacted postmaster@. > Appears I misread a message saying “it looks like it’s happening on > python-dev too” to mean that the request was for both lists. > > Should I disable moderation on python-dev? It depends on who you want to annoy least ;-) If it's the position of the PSF that some kind(s) of messages must be suppressed, then I'll need a more-or-less clear definition, from them, of what those consist of. At which point I'll agree to enforce them, or step down as a python-dev moderator. Otherwise, as I said, I've seen nothing on python-dev so far that I would have rejected. So all "emergency moderation status" is doing for python-dev so far is annoying me with more email to go approve messages waiting in the queue, and to annoy those messages' senders with delays (waiting for moderator action). So it's not, in reality, accomplishing anything of value here. So, if I were you, I'd disable emergency moderation status on python-dev. But then I'm old, retired, and notoriously pig-headed ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LZVR6SLZN4LCGD4SC4BCVHKMUH6GKA6D/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Lists placed into Emergency Moderation status
[Ernest W. Durbin III ] > At the request of the list moderators of python-ideas and python-dev, > both lists have been placed into emergency moderation mode. All > new posts must be approved before landing on the list. > > When directed by the list moderators, this moderation will be disabled. I do at least 99% of python-dev moderation actions, and this was news to me. I'm not going to reject anything Python-related based on whether I (or anyone else) agree(s) with its political spin. If that's what's wanted, please contact whichever python-dev moderator(s) did request this, and tell them the job of actually doing python-dev moderation is all theirs until this passes. Otherwise anything Python-related short of pure abuse will be approved. Seriously. I'm not going to reject sincere "you're a racist!" posts or "I know you are, but what am I?" posts. But I don't even know whether that's what this is about - all I've seen is some moderately heated discussion, none of which I would have rejected. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PYRZRPDJRVVJVICEC3N4YLR2XG2A7SEC/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PEP 622: Structural Pattern Matching
[Tim] See reply to Glenn. Can you give an example of a dotted name that is not a constant value pattern? An example of a non-dotted name that is? If you can't do either (and I cannot)), then that's simply what "if [Rhodri James ] >>> case long.chain.of.attributes: [Tim] >> That's a dotted name and so is a constant value pattern - read the PEP. >> >> Every dotted name in a pattern is looked up using normal Python >> name resolution rules, and the value is used for comparison by >> equality with the matching expression (same as for literals). [Rhodri] > Then I am surprised, which is worse. "long.chain.of.attributes" looks > like an assignment target, and I would have expected the case to have > been a name pattern. As always, I don't care whether something is obvious at first glance. I care whether something can be learned with reasonable effort, and "sticks" _after_ it's learned. There's essentially nothing truly obvious about programming. This, from the PEP, is the entire grammar for a "name pattern'" name_pattern: NAME !('.' | '(' | '=') That's it. A plain name not followed by a dot, left paren, or equality sign. While it may or may not surprise any given person at first glance, it's very simple. Put a fraction of the effort into learning it as you're willing to expend on articulating surprise, and it would already be far behind you ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HQHVWUA7TSDOOGJTK4CO32CS3TRHEMW6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PEP 622: Structural Pattern Matching
[Rhodri James ] > I'm seriously going to maintain that I will forget the meaning of "case > _:" quickly and regularly, Actually, you won't - trust me ;-) > just as I quickly and regularly forget to use > "|" instead of "+" for set union. More accurately, I will quickly and > regularly forget that in this one place, "_" is special. Because that's the opposite of "accurate". There's nothing special about "_" "in this one place". It's but a single application of that "_" is used as a wildcard in _all_ matching contexts throughout the PEP. And it's not even new with this PEP. "_" is routinely used already in lots of code to mean "the syntax requires a binding target here, but I don't care about the binding", from lists = [[] for _ in range(100)] to first, _, third = triple The last is especially relevant, because that's already a form of destructuring. The only thing new about this use of "_" in the PEP is that it specifies no binding will occur. Binding does occur in the examples above (because there's nothing AT ALL special about "_" now - it's just a one-character identifier, and all the rest is convention, including that the REPL uses it to store the value of the last-displayed object). >> See reply to Glenn. Can you give an example of a dotted name that is >> not a constant value pattern? An example of a non-dotted name that is? >> If you can't do either (and I cannot)), then that's simply what "if >case long.chain.of.attributes: That's a dotted name and so is a constant value pattern - read the PEP. Every dotted name in a pattern is looked up using normal Python name resolution rules, and the value is used for comparison by equality with the matching expression (same as for literals). > or more likely > >case (foo.x, foo.y) Ditto. > for the first. For the second, it's a no-brainer that you can't have a > non-dotted name as a constant value pattern, since the current constant > value pattern mandates a leading dot. Not so. _Solme_ dot is necessary and sufficient to identify a constant value pattern now. A leading dot is only _required_ in case an intended constant value pattern would have no dots otherwise. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VDDYNQO7JOEZ2ENSHWIJAYBGXAHLBVLI/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PEP 622: Structural Pattern Matching
[Taine Zhao ] > "or" brings an intuition of the execution order of pattern matching, just > like how people already know about "short-circuiting". > > "or" 's operator precedence also suggests the syntax of OR patterns. > > As we have "|" as an existing operator, it seems that there might be > cases that the precedence of "|" is not consistent with it in an > expression. This will mislead users. > > You said "All reuse of symbols carries baggage", I'd say, > > All **inconsistent** reuse of symbols carries baggage, but the > consistent reuse builds good intuitive sense and shows the good > taste of designers. We're not talking about abstract computation here: this is a specific feature, and "|" is the _only_ infix operator. The PEP considered and rejected "&" and a unary "not", so that's the universe we're left with. With only one "operator", it's really hard to "mislead" ;-) In any case, the model here is far more regular expressions than Python int arithmetic or set unions. "|" means essentially the same thing in the PEP as it does in Python regexps: tru supatterns one at a time, left to right. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LH6WUBPBAEPSOAMSB37ZWA2Z2G7X47AB/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PEP 622: Structural Pattern Matching
[Ethan Furman ] > "case _:" is easy to miss -- I missed it several times reading through the > PEP. As I said, I don't care about "shallow first impressions". I care about how a thing hangs together _after_ climbing its learning curve - which in this case is about a nanometer tall ;-) You're not seriously going to maintain that you're struggling to grasp the meaning of "case _:" now, right? > Huh. I would consider "case _:" to be the wart, especially since "case > default:" > or "case anything:" or "case i_dont_care:" all do basically the same thing > (although > they bind to the given name, Having climbed the trivial learning curve, only a deliberate wise ass would code garbage like "case i_dont_care:". I don't care about them either. The one obvious way to do it has already been made clear to them. You may as well, e.g., complain that there's nothing to stop a wise ass from writing "-5+6" where everyone else writes "+1". > while _ does not bind to anything, but of what practical importance is that?) > . One obvious way to do it is of major practical importance. > ... > Besides which, if we use "|" instead of "or" then we can't later allow more > general expressions. Good! The PEP is quite complicated enough already. But if you want to pursue this seriously, you're going to have your work cut for you to explain why "|" is more sacred than "or" with respect to "more general expressions". If you don't want to preclude anything, then you need to invent syntax that has no meaning at all now. >> ".NAME" grated at first, but extends the idea that dotted names are >> always constant value patterns to "if and only if". So it has mnemonic >> value. > How do you get from "." to "iff" ? See reply to Glenn. Can you give an example of a dotted name that is not a constant value pattern? An example of a non-dotted name that is? If you can't do either (and I cannot)), then that's simply what "if and only if" means. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YGTDPCASYUQYCMU5PS2S5YO7DBNDYYWP/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PEP 622: Structural Pattern Matching
[Tim] >> ".NAME" grated at first, but extends the idea that dotted names are >> always constant value patterns to "if and only if". So it has mnemonic >> value. When context alone can't distinguish whether a name is meant as >> (in effect) an lvalue or an rvalue, no syntax decorations can prevent >> coding errors. Names in destructuring constructs are overwhelmingly >> intended as lvalues, so adding extra cruft to say "no, I meant rvalue" >> is the pragmatic choice. [Glenn Linderman ] > This is just a bunch of words to me, without meaning. > > I'd like to understand it. Did you read the PEP? > What do you mean by "the idea that dotted names are always constant > value patterns"? Under the PEP's "Constant Value Pattern" section: Every dotted name in a pattern is looked up using normal Python name resolution rules, and the value is used for comparison by equality with the matching expression (same as for literals). That's what I meant. "Contains a dot" implies "constant value pattern". > What do you mean by 'extends (the above) to "if and only if" '? Because the next sentence from the PEP: As a special case to avoid ambiguity with name patterns, simple names must be prefixed with a dot to be considered a reference: completes turning "contains a dot" into a necessary and sufficient ("if and only if") condition for distinguishing a constant value pattern from a name pattern. Where "constant value pattern" and "name pattern" are again used with the PEP's meanings. > As a result of not understanding the above, I see no mnemonic value. While I do. "If I want `xyz` to be interpreted as a constant value pattern, it needs to contain a dot: `.xyy` should do it. If I want `enums.HI` to be interpreted as a constant value, it already contains a dot, so it will be." > My understanding of the "." as proposed is that it is optional, except > in cases where it would be ambiguous... seems like it would be better if > it were required for one case or the other, so that there would be no > need to determine whether or not it is ambiguous by the surrounding > state/code/declarations. A dot is required, when and only when you want the chunk of syntax to be interpreted as a constant value pattern. I said nothing at all about "_leading_" dots, which appear to be all you have in mind there. _Some_ dot is mandatory to make it a constant value pattern; a _leading_ dot may or may not be required. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YEUQNW5NUAW2OGMRJN22G3JMSIB4PA4J/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: PEP 622: Structural Pattern Matching
You got everything right the first time ;-) The PEP is an extended illustration of "although that way may not be obvious at first unless you're Dutch". I too thought "why not else:?" at first. But "case _:" covers it in the one obvious way after grasping how general wildcard matches are. Introducing "else:" too would be adding a wart (redundancy) just to stop shallow-first-impression whining. "|" is also a fine way to express alternatives. "case" has its own sub-language with its own rules, and "|" is widely used to express alternatives (whether in regexps, formal grammars, ...). Spell it, e.g., "or", and then I wonder "what does short-circuiting have to do with it?". All reuse of symbols carries baggage. ".NAME" grated at first, but extends the idea that dotted names are always constant value patterns to "if and only if". So it has mnemonic value. When context alone can't distinguish whether a name is meant as (in effect) an lvalue or an rvalue, no syntax decorations can prevent coding errors. Names in destructuring constructs are overwhelmingly intended as lvalues, so adding extra cruft to say "no, I meant rvalue" is the pragmatic choice. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YDEHQRHB5S4O6KP5ROUZGOKSR3H54T34/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: How to enable tracemalloc for the test suite?
For posterity, just recording best guesses for the other mysteries left hanging: - PYTHONTRACEMALLOC didn't work for you because Victor's traceback showed that Py_FinalizeEx was executing _PyImport_Fini,, one statement _after_ it disabled tracemalloc via _PyTraceMalloc_Fini. - The address passed to free() was for the struct representing the False singleton, which is static Since it wasn't obtained from a malloc/realloc call to begin with, the free() debug code didn't find anything it expected to find. As noted earlier, if tracemalloc had still been active, it wouldn't have found that address in its database, and so would have produced no output at all. So - that was easy ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BRPNFTDWNJKVZS3KUKWYPN46KGER6LDU/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: How to enable tracemalloc for the test suite?
[Skip Montanaro ] > ... > I thought setting PYTHONTRACEMALLOC should provoke some useful output, > but I was confused into thinking I was (am?) still missed something > because it continued to produce this message: > > Enable tracemalloc to get the memory block allocation traceback Ah, I overlooked that. > which suggests to me tracemalloc still isn't enabled. That's emitted > from Modules/_tracemalloc.c and seems to be properly protected: > > if (!_Py_tracemalloc_config.tracing) { > PUTS(fd, "Enable tracemalloc to get the memory block " > "allocation traceback\n\n"); > return; > } > > so I think there is still more to do. Agreed - for whatever reason, tracemalloc is not enabled for you at the time your run dies, despite your > % PYTHONTRACEMALLOC=5 ./python ./Tools/scripts/run_tests.py -R 5:50:reflog.txt test_rattlesnake command line. Have to leave that mystery for Victor! So between the "obvious" possibilities: - something is going nuts spraying 0 all over memory - the pointer passed to free() wasn't actually obtained from a malloc/realloc function there's nothing in the output to favor one over the other. Except that dereferencing addresses near the pointer didn't blow up (null bytes were returned), so the pointer is at least in the process's address space. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/74XUF5UNOID5YZ2EOUJD5YXOSNHIBEGT/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: How to enable tracemalloc for the test suite?
[Skip Montanaro ] > I've got a memory issue in my modified Python interpreter I'm trying > to debug. Output at the end of the problematic unit test looks like this: To my eyes, you left out the most important part ;-) A traceback showing who made the fatal free() call to begin with. In debug mode, allocations are padded on both ends with various stuff: FORBIDDENBYTEs (0xfd), a count of the number of bytes originally requested, and a serial number incremented by 1 each time alloc/realloc is called. The memory you requested is entirely filled with CLEANBYTEs ()xcd). On deallocation, that stuff is checked (that's where your run is dying), and the memory is filled with DEADBYTEs (0xdd) (a weak but nevertheless often effective way to detect "double free"s). Now in your case, absolutely every byte shown is 0x00. There are no useful clues at all remaining. It's not just a short buffer overrun, _everything_ has been NULLed out, including before the requested memory. So something has gone nuts spraying 0 all over memory ... or the pointer passed to free() was nonsense to begin with. If it were a case of double free, we'd usually expect to see some DEADBYTEs in the debug output. But there weren't any. However, you did not get any output from tracemalloc, despite that it sounds like you set up the right magic to activate it. Which leans toward the "complete nonsense" hypothesis. If tracemalloc doesn't find the address in its dict of currently-allocated addresses, looks like it fails silently, simply not producing any output. Which is what you're experiencing. So where was free() called and what was passed to it? All the evidence is consistent with the idea that what was passed to free() was never obtained from a malloc or reallloc call to begin with. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/A32E5ZVJ4IZBNDHMLOXH2OOAB2CKAYJN/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?
[Tim] >> PyObject_RichCompareBool(x, y, op) has a (valuable!) shortcut: if x >> and y are the same object, then equality comparison returns True >> and inequality False. No attempt is made to execute __eq__ or >> __ne__ methods in those cases. >> ... >> If it's intended that Python-the-language requires this, that needs to >> be documented. [Raymond] > This has been slowly, but perhaps incompletely documented over the > years and has become baked in the some of the collections ABCs as well. > For example, Sequence.__contains__() is defined as: > > def __contains__(self, value): > for v in self: > if v is value or v == value: # note the identity test > return True > return False But it's unclear to me whether that's intended to constrain all implementations, or is just mimicking CPython's list.__contains__. That's always a problem with operational definitions. For example, does it also constrain all implementations to check in iteration order? The order can be visible, e.g, in the number of times v.__eq__ is called. > Various collections need to assume reflexivity, not just for speed, but so > that we > can reason about them and so that they can maintain internal consistency. For > example, MutableSet defines pop() as: > > def pop(self): > """Return the popped value. Raise KeyError if empty.""" > it = iter(self) > try: > value = next(it) > except StopIteration: > raise KeyError from None > self.discard(value) > return value As above, except CPyhon's own set implementation implementation doesn't faithfully conform to that: >>> x = set(range(0, 10, 2)) >>> next(iter(x)) 0 >>> x.pop() # returns first in iteration order 0 >>> x.add(1) >>> next(iter(x)) 1 >>> x.pop() # ditto 1 >>> x.add(1) # but try it again! >>> next(iter(x)) 1 >>> x.pop() # oops! didn't pop the first in iteration order 2 Not that I care ;-) Just emphasizing that it's tricky to say no more (or less) than what's intended. > That pop() logic implicitly assumes an invariant between membership and > iteration: > >assert(x in collection for x in collection) Missing an "all". > We really don't want to pop() a value *x* and then find that *x* is still > in the container. This would happen if iter() found the *x*, but discard() > couldn't find the object because the object can't or won't recognize itself: Speaking of which, why is "discard()" called instead of "remove()"? It's sending a mixed message: discard() is appropriate when you're _not_ sure the object being removed is present. > s = {float('NaN')} > s.pop() > assert not s # Do we want the language to guarantee that > # s is now empty? I think we must. I can't imagine an actual container implementation that wouldn't. but no actual container implements pop() in the odd way MutableSet.pop() is written. CPython's set.pop does nothing of the sort - doesn't even have a pointer equality test (except against C's NULL and `dummy`, used merely to find "the first (starting at the search finger)" slot actually in use). In a world where we decided that the identity shortcut is _not_ guaranteed by the language, the real consequence would be that the MutableSet.pop() implementation would need to be changed (or made NotImplemented, or documented as being specific to CPython). > The code for clear() depends on pop() working: > > def clear(self): > """This is slow (creates N new iterators!) but effective.""" > try: > while True: > self.pop() > except KeyError: > pass > > It would unfortunate if clear() could not guarantee a post-condition that the > container is empty: That's again a consequence of how MutableSet.pop was written. No actual container has any problem implementing clear() without needing any kind of object comparison. > s = {float('NaN')} > s.clear() > assert not s # Can this be allowed to fail? No, but as above it's a very far stretch to say that clear() emptying a container _relies_ on the object identity shortcut. That's a just a consequence of an odd specific clear() implementation, relying in turn on an odd specific pop() implementation that assumes the shortcut is in place. > The case of count() is less clear-cut, but even there > identity-implies-equality > improves our ability to reason about code: Absolutely! That "x is x implies equality" is very useful. But that's not the question ;-) > Given some list, *s*, possibly already populated, would you want the > following code to always work: > > c = s.count(x) > s.append(x) > assert s.count(x) == c + 1 # To me, this is fundamental > to what the word > "count" means. I would, yes. But it's also possible to define s.
[Python-Dev] Re: Python library for the ServiceNow REST API
[Guido] > Honestly that looked like a spammer. I approved the message, and it looked like "probably spam" to me too. But it may have just been a low-quality message, and the new moderator UI still doesn't support adding custom text to a rejection message. Under the old system, I _would_ have rejected it, but added a message explaining why. As I've said before, I give broad benefit-of-the-doubt to marginal messages, and for that reason alone some spam is certain to get approved from time to time. A vast majority of spam directed to this list does get discarded, because there's usually no plausible doubt at all. I don't believe a legitimate message has ever been rejected, and _that's_ more important to me. > Note that at the bottom they wrote > > servicenow training > > A typical strategy for such spammers is to send a message to a popular > mailing list with some text that looks vaguely related (to the spammer, > who is usually not well versed in the subject of the list) and add some > link that they want to promote (presumably a hacked or scam site). I went to the site first, and my up-to-date Malwarebytes installation (both the full commercial product and the browser extension) found nothing to complain about except that the page linked to one ad network (common as mud). It got a cleaner report from that than _most_ sites I visit from day to day. For example, Malwarebytes also points out that www.python.org links to one "ad network or tracker" (ssl.google-analytics.com). > The list moderator ought to ban the user. I will not, but fine by me if some other python-dev moderator does. That poster is still subject to moderation. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/EX2QEOZV3BRH5DH3JZOJPULSSZDQZDG3/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?
[Terry Reedy ] ]& skipping all the parts I agree with] > ... > Covered by "For user-defined classes which do not define __contains__() > but do define __iter__(), x in y is True if some value z, for which the > expression x is z or x == z is true, is produced while iterating over y. > " in > > https://docs.python.org/3/reference/expressions.html#membership-test-operations Just went through that with Serhiy. Very briefly, two problems: 1. Those docs are over-specified if we intend that the "x is z" is an implementation choice rather than a language requirement. 2. Repeating all that stuff all over the docs is a PITA, error-prone, and ugly, since contexts that call PyObject_RichCompareBool() under the covers are all over the place too. Document it carefully in _one_ place (the Reference Manual already has your good start), and just plant links to that in high-value locations elsewhere in the docs. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4EVEQZZPEPAGSJ25SSG2EQWQ3MUP2ZHW/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?
[Tim] >> I think it needs more words, though, to flesh out what about this is >> allowed by the language (as opposed to what CPython happens to do), >> and to get closer to what Guido is trying to get at with his >> "*implicit* calls". For example, it's at work here, but there's not a >> built-in container in sight: >> >> >>> import math >> >>> def f(): >> ... while True: >> ... yield math.nan >> >>> math.nan in f() >> True [Serhiy Storchaka ] > It is documented in: > https://docs.python.org/3/reference/expressions.html#membership-test-operations > > """ > For user-defined classes which do not define :meth:`__contains__` but do > define > :meth:`__iter__`, ``x in y`` is ``True`` if some value ``z``, for which the > expression ``x is z or x == z`` is true, is produced while iterating > over ``y``. > ... But _should_ it be? There are two parts to that: 1. Consensus seems to be that whether - and where - implicit equality comparisons exploit object identity should be implementation-defined. If so, the quoted docs are over-specified, elevating a CPython implementation choice to a language requirement. 2. As I said in the first message, consequences of this choice are "all over the place". Repeating equivalents of "x is z or x == z" all over the docs is ugly and error-prone, and especially if it also needs to spell out that the "x is z" part is something CPython does but that other implementations may not do. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QLBATZTQQU5H4BGJAFRO3DNXN23NB7UU/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?
[Terry Reedy ] > ... > It is, in the section on how to understand and use value comparison > *operators* ('==', etc.). > https://docs.python.org/3/reference/expressions.html#value-comparisons > > First "The default behavior for equality comparison (== and !=) is based > on the identity of the objects." > > Then in particular, "The built-in containers typically assume identical > objects are equal to themselves. That lets them bypass equality tests > for identical objects to improve performance and to maintain their > internal invariants." Cool! I didn't find that, and I assume Victor didn't either. It's good! I think it needs more words, though, to flesh out what about this is allowed by the language (as opposed to what CPython happens to do), and to get closer to what Guido is trying to get at with his "*implicit* calls". For example, it's at work here, but there's not a built-in container in sight: >>> import math >>> def f(): ... while True: ... yield math.nan >>> math.nan in f() True >> For example, in __eq__() method documentation: >> https://docs.python.org/dev/reference/datamodel.html#object.__eq > The text that follows discusses rich comparison *special methods* and > how to write them. It should refer people to the value-comparisons > section for information on specific classes, as in the second quote > above. It would not hurt if the operator section referred people back > to special method discussion. I think you should go ahead and add one > or both links. Agreed. If I want to know why my __eq__ (or __ne__) isn't being called, it's __eq__/__ne__ docs I'm going to look at first. And probably last, unless I'm nudged in the right direction. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/T4BAR4DVTMTR3GOJ2Y2AL2ZBCL6VSPUK/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?
[Inada Naoki ] > FWIW, (list|tuple).__eq__ and (list|tuple).__contains__ uses it too. > It is very important to compare recursive sequences. > > >>> x = [] > >>> x.append(x) > >>> y = [x] > >>> z = [x] > >>> y == z > True That's a visible consequence, but I'm afraid this too must be considered an implementation-defined quirk. If doing a fine job of comparing recursive containers was the actual goal, we would have needed to do a much fancier thing, akin to testing graph isomorphism: >>> a = [] >>> b = [] >>> for x in a, b: ... x.append(x) >>> a [[...]] >>> b [[...]] >>> a == a True >>> b == b True >>> a == b Traceback (most recent call last): ... RecursionError: maximum recursion depth exceeded in comparison Instead the happy behavior in your example (and in my first two examples) just follows as a no-effort consequence of the special-casing for identity equality in PyObject_RichCompareBool(). (Note that the _top_ level comparisons in my "a == a" and "b == b" examples do _not_ exploit that the comparands are the same object - it's list_richcompare() calling PyObject_RichCompareBool() where the latter tells the former that a[0] and a[0] are equal.) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/L4RPMHX7USTWLACBC5URKB3WXNBH2DKA/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Are PyObject_RichCompareBool shortcuts part of Python or just CPython quirks?
PyObject_RichCompareBool(x, y, op) has a (valuable!) shortcut: if x and y are the same object, then equality comparison returns True and inequality False. No attempt is made to execute __eq__ or __ne__ methods in those cases. This has visible consequences all over the place, but they don't appear to be documented. For example, >>> import math >>> ([math.nan] * 5).count(math.nan) 5 despite that `math.nan == math.nan` is False. It's usually clear which methods will be called, and when, but not really here. Any _context_ that calls PyObject_RichCompareBool() under the covers, for an equality or inequality test, may or may not invoke __eq__ or __ne__, depending on whether the comparands are the same object. Also any context that inlines these special cases to avoid the overhead of calling PyObject_RichCompareBool() at all. If it's intended that Python-the-language requires this, that needs to be documented. Or if it's implementation-defined, then _that_ needs to be documented. Which isn't straightforward in either case, in part because PyObject_RichCompareBool isn't a language-level concept. This came up recently when someone _noticed_ the list.count(NaN) behavior, and Victor made a PR to document it: https://github.com/python/cpython/pull/18130 I'm pushing back, because documenting it _only_ for .count() makes .count() seem unique in a way it isn't, and doesn't resolve the fundamental issue: is this language behavior, or implementation behavior? Which I don't want to argue about. But you certainly should ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3ZAMS473HGHSI64XB3UV4XBICTG2DKVF/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Documenting Python's float.__str__()
[Serhiy Storchaka] > This is not the only difference between '.17g' and repr(). > > >>> '%.17g' % 1.23456789 > '1.23456788' > >>> format(1.23456789, '.17g') > '1.23456788' > >>> repr(1.23456789) > '1.23456789' More amazingly ;-), repr() isn't even always the same as a %g format specifying exactly the same number of digits as repr() produces. That's because repr(x) currently returns the shortest string such that eval(repr(x)) == x, but %g rounds correctly to the given number of digits. Not always the same thing! >>> x = 2.0 ** 89 >>> print(repr(x)) 6.189700196426902e+26 >>> print("%.16g" % x) # repr produced 16 digits 6.189700196426901e+26 The repr() output is NOT correctly rounded. To see which one is correctly rounded, here's an easy way: >>> import decimal >>> decimal.Decimal(x) Decimal('618970019642690137449562112') The "37449562112" is rounded off, and is less than half a unit in the last place, so correct rounding truncates the last digit to 1. But there is no string with 16 digits other than the incorrectly rounded one repr() returns that gives x back. In particular, the correctly rounded 16 digit string does not: >>> 6.189700196426901e+26 # 16-digit correctly rounded fails 6.189700196426901e+26 >>> x == _ False To my mind it's idiotic(*) that "shortest string" requires incorrect rounding in some cases. In Python's history, eval(repr(x)) == x is something that was always intended, so long as the writing and reading was done by the same Python instance on the same machine. Maybe it's time to document that ;-) But CPython goes far beyond that now, also supplying correct rounding, _except_ for repr's output, where - for reasons already illustrated - "correct rounding" and "shortest" can't always both be satisfied. (*) What wouldn't be idiotic? For repr(x) to return the shortest _correctly rounded_ string such that eval(repr(x)) == x. In the example, that would require repr(x) to produce a 17-digit output (and 17 is the most that's ever needed for a Python float on boxes with IEEE doubles). But "shortest string" was taken ultra literally by the people who first worked out routines capable of doing that, so has become a de facto standard now. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/F3BOGGTGJPZS3RR7FKG7YE6GYADHYI76/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Object deallocation during the finalization of Python program
[Pau Freixes ] > Recently I've been facing a really weird bug where a Python program > was randomly segfaulting during the finalization, the program was > using some C extensions via Cython. There's nothing general that can be said that would help. These things require excruciating details to resolve. It's generally the case that these things are caused by code missing some aspect of correct use of Python's C API. Which can be subtle! It's very rare that such misuse is found in the core, but not so rare in extensions. Here's the most recent. Be aware that it's a very long report, and - indeed - is stuffed to the gills with excruciating details ;-) https://bugs.python.org/issue38006 In that case a popular C extension wasn't really "playing by the rules", but we changed CPython's cyclic garbage collector to prevent it from segfaulting anyway. So, one thing you didn't tell us: which version of Python were you using? If not the most recent (3.8.1), try again with that (which contains the patches discussed in the issue report above). > Debugging the issue I realized that during the deallocation of one of > the Python objects the deallocation function was trying to release a > pointer that was surprisingly assigned to NULL. The pointer was at > the same time held by another Python object that was an attribute of > the Python object that had the deallocation function, something like > this: > > ... > > So my question would be, could CPython deallocate the objects during > the finalization step without considering the dependencies between > objects? No, it could not. BUT. But but but. If C code isn't playing by all the rules, it's possible that CPython doesn't _know_ what the dependencies actually are. CPython can only know about pointers that the user's C API calls tell CPython about. In the report above, the C extension didn't tell CPython about some pointers in cycles at all. Curiously, for most kinds of garbage collectors, failing to inform the collector about pointers is massively & widely & very reliably catastrophic. If it doesn't see a pointer, it can conclude that a live object is actually trash. But CPython's cyclic collector follows pointers to prove objects _are_ trash, not to prove that they're alive. So when CPython isn't told about a pointer, it usually "fails soft", believing a trash object is actually alive instead. That can cause a memory leak, but usually not a crash. But not always. In the presence of weakrefs too, believing a trash object is actually alive can cause some kinds of finalization to be done "too late", which can cause a crash when refcounting discovers that the object actually is trash. No, you're not going to get a simple example of that ;-) DIg through the issue report above. > If this is not the right list to make this kind of questions, just let > me know what would be the best place for making this kind of questions Open an issue report instead? Discussions won't solve it, alas. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WTH2LH3S52CN435B6I273BJ3QJW5GTES/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Larry] > It's a lightweight abstract dependency graph. Its nodes are opaque, > only required to be hashable. And it doesn't require that you give it > all the nodes in strict dependency order. > > When you add a node, you can also optionally specify > dependencies, and those dependencies aren't required > to be nodes that the graph has previously seen. So you can add > a node A that depends on B and C, without showing it B or C > first. When that happens it creates placeholder nodes for B > and C, and naturally these nodes have no dependencies. You > can then later inform the graph that node B has dependencies > on X Y and Z. > > (My specific use case is a package manager. Packages are identified > with unique strings. So the nodes I give the give the graph are simply > those package names. It's pretty common for me to get a package > with dependencies on packages I haven't seen yet. Designing the > graph to create placeholders let me skip making two passes over > the package list, pre-creating the package objects, etc etc. This > made the code simpler and presumably faster.) > > The graph API maintains an externally-visible "ready" iterable of > nodes. And since B can go from ready to not-ready, it can get added > to "ready" and subsequently removed. > > Also, when a node is satisfied, you simply remove it from the graph. > If A depends on B and C, and they all represent jobs, and B and C > have no dependencies, they'll be in the "ready" iterable. You iterate > over "ready", and execute those jobs, and assuming they are > successful you then remove them from the graph. When A's > dependencies are all satisfied, it'll get added to the "ready" iterable. > And naturally when B and C were removed from the graph, they were > removed from the "ready" iterable too. OK! You're doing a topological sort. There are natural & simple ways to do that right now that scale efficiently to very large graphs (time & space linear in the sum of the number of nodes and the number of dependencies). Curiously, the difficulties are mostly driven by the quality of _error_ messages you want (in case of, e.g., cycles in the dependency graph). Lots of those implementations became deterministic "by magic" when ordered dicts were introduced. This is why: a bare bones implementation needs to associate two pieces of info with each node: a list of its immediate successors, and an integer count of the number of its immediate predecessors. A dict is the natural way to implement that association. When all the dependency info has been entered, then: - First all nodes are emitted whose predecessor count is 0. Iterating over the association dict once is the natural way to find them, and that order is defined now. - Then, as each emitted node N is marked done: for child in N.successors: assert child.predcount > 0 child.predcount -= 1 if child.predcount == 0: emit(child) That's about all there is to it. But there's no end to the cruft that _can_ be added to, e.g., verify that a node is marked done no more than once, or to compute & display strongly connected components if not all nodes are emitted, etc. Ordered sets could improve this "natural" implementation if the successor lists were successor sets instead, simply because then - like lists - their iteration order would be defined, which can in turn affect the order in which nodes are emitted in the loop just above. But lists are adequate to the task, are cheaper in both time and space, and can't _not_ be ordered ;-) > Thus it's this "ready" collection that I want to a) be iterable, b) be > stable, and > c) support fast removal. I add and remove nodes from that set a lot, which I > realized would perturb the iteration order from run to run, etc etc etc, and > here > we are. The sketch above (which is a bog-common way to implement topsorts) doesn't build a data structure recording the final order, and, in fact, never deletes anything. You might protest that the initial iteration step (scanning the association dict to find nodes with no predecessors) is an expense you skip because nodes with predecessors are deleted from your "ready" structure all along. But nodes with predecessors are _touched_ either way, and merely skipping over them is bound to be cheaper than deleting them. After that initial step, no search of any kind is needed again. > I grant you maybe it's a weird approach. But after two false starts, where I > was starting to boil the oceans with ABCs and "node is satisfied" APis and > separate iterator objects, I had a re-think and hit on this super lightweight > design. I'm actually quite pleased with it--it's short, it has a minimal API, > it's readable, it was easy to get right, and it solves my problem. Whereas I would have started with a topsort and finished it while I was sleeping ;-) Seriously, I've written such a thing many times, but never reuse the code because it's so easy to redo
[Python-Dev] Re: Should set objects maintain insertion order too?
[Larry] > Here is the original description of my problem, from the original email in > this thread. I considered this an adequate explanation of my problem > at the time. >> I do have a use case for this. In one project I maintain a "ready" list of >> jobs; I need to iterate over it, but I also want fast lookup because I >> soemtimes remove jobs when they're subsequently marked "not ready". Which is a description not of "the problem", but of the operations you want to perform. It's the first time in my life I've heard of a "ready list of jobs" where the programmer sometimes decides later "oh, no - this one and that one aren't ready after all" ;-) Not that it matters much - you want the operations you want. The lack of "oh - of course - that's a problem we all face at times" motivation, though, works against it being _compelling_ on its own. > ... > In subsequent emails I've clarified that my workloads are small enough--and > computers are fast enough--that almost any data structure would work fine > for me here, e.g. a list. > > I don't think my needs should drive the decision making process regardless. > I only described my problem to get the conversation rolling. Which it did :-) Others went on to, explicitly or implicitly, suggest that an ordered set must also, e.g., support the entire MutableSequence interface, and even define what happens if you mutate the ordered set _while_ iterating over it (lists define the results in such cases, but ordered dicts raise an exception). > I opine that anybody who iterates over set objects and has bugs in their > code would appreciate set objects maintaining insertion order. I suspect > it would make their debugging easier, because given identical inputs their > set iteration would behave identically, thus making their bugs that much > more stable. That's as much use case as I have for the feature. That's much stronger to me than some weird FIFO-only queue supporting fast deletion of interior elements (which - sorry - I've never had a use for). And fine by me if such a thing is added. All else being equal, I'd prefer that the builtin set type be ordered, simply because it's less surprising. and dicts are ordered now (although, as Inada Naoki said, their current implementation isn't suitable for a FIFO queue, doe primarily to O(N) time needed to find "the first" key+value record). I believe we agree that adding _some_ flavor of OrderedSet to `collections` would be a fine start. I'm just not cool with replacing the builtin set before there's an implementation fast and memory-efficient enough that _current_ heavy set users won't feel betrayed by major slowdowns or memory bloat when the switch is made. Toward that end, my opinion - which I won't defend - is that OrderedSet should not promise better than O(N)-time indexing (if it supports indexing at all), and should - like current sets and dicts - try to raise an exception if it can detect that the container has been mutated during iteration. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AH4LQB3OCMCE22FEWUALRND5P6AHUWE6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Nick Coghlan ] > I took Larry's request a slightly different way: he has a use case where > he wants order preservation (so built in sets aren't good), but combined > with low cost duplicate identification and elimination and removal of > arbitrary elements (so lists and collections.deque aren't good). Organising > a work queue that way seems common enough ... Is it? I confess I haven't thought of a plausible use case. Larry didn't really explain his problem, just suggested that an ordered set would be "a solution" to it. The problem: whether there are duplicates "in the queue" is a question about an implementation detail, hard for me to translate to a question about the _task_ to be solved. For example, suppose "the task" is to make a deep copy of a web site. A "job" consists of following a URL, sucking down the page text, and adding new jobs for contained URLs on the page. We probably don't want to suck down the page text multiple times for a given URL, but checking whether a URL is currently already in the job queue is asking a question about an implementation detail that misses the point: we want to know whether that URL has already been chased, period. Whether the URL is currently in the queue is irrelevant to that. The only logical relationship is that a job that has finished _was_ in the queue at some time before the job finished. So, ya, I've seen and implemented lots of work queues along these lines - but an OrderedSet would be an "attractive nuisance" (offhandedly appearing to solve a problem it doesn't actually address): jobs = some_kind_of_queue() finished_jobs = set() ... while jobs: job = jobs.get() if job in finished_jobs: continue try: work on the job possibly adding (sub)jobs to the `jobs` queue except TransientExceptions: jobs.put(job) # try again later else: finished_jobs.add(job) Clear? There is a queue here, and a set, but they can't be combined in a useful way: the set contents have nothing to do with what's currently in the queue - the set consists of jobs that have been successfully completed; the queue doesn't care whether it contains duplicates, and merely skips over jobs that have been completed by the time the queue spits them out (which strikes me as more robust design than duplicating logic everywhere a job is added to a queue to try to prevent adding already-done jobs to begin with). Similarly, in other places I've used a "finished" flag on the object instead of a set, or a dict mapping jobs to info about job status and results. But in all these cases, the vital info associated with a job really has little to do with whether the job is currently sitting on a queue. If "is it currently on the queue?" _is_ a common question to ask, perhaps someone could give a semi-realistic example? ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2XDEVSO5S5ZLVZTLX43UHBOJWMXYJIIB/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Tim] >> - I don't have a theory for why dict build time is _so_ much higher >> than dict lookup time for the nasty keys. To be clearer, in context this was meant to be _compared to_ the situation for sets. These were the numbers: 11184810 nasty keys dict build 23.32 dict lookup 13.26 set build 9.79 set lookup8.46 Build time is higher than lookup time for both, which isn't surprising. But it's _much_ higher (whether in absolute or relative terms) for dicts than for sets. [Serhiy Storchaka ] > 1/2 of items were copied to a new hashtable when the dict grew up. 1/4 > of items were copied twice. 1/8 of items were copied three times, etc. > In average every item is copied 1 time, so the build time is twice the > lookup time when the cost of lookup is dominated. I agree the average total number of table insertions per element approaches 2 either way (counting insertions due to internal table growth as well as insertions visible from Python code). That's why it's expected that build time exceeds lookup time (the latter does exactly one lookup per element). > But the lookup benchmarking includes the overhead of iterating Python > loop, which is more significant for "good" keys, thus the lookup time is > larger than the build time. Iteration time is too minor to care much about. Even in the worst case (sets with the nasty keys), the Python: for k in d: pass takes under a second here. Take a full second off the "set lookup" time, and dict build time still exceeds dict lookup time (whether in absolute or relative terms) much more so than for sets. But I'm not going to pursue it. The high-order bit was just to demonstrate that the set object's more complicated (than dicts) probe sequence does buy highly significant speedups - in at least one highly contrived case. Since the probe sequence changes were aimed at reducing cache misses, a full account requires comparing cache misses between the dict and set implementations. That's a mess, and slogging through it appears unlikey to result in insight. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/IYPHUKW7DGYNAFAC4AOCDYYUWANHN5YE/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Tim] > I know the theoretical number of probes for dicts, but not for sets > anymore. The latter use a mix of probe strategies now, "randomish" > jumps (same as for dicts) but also purely linear ("up by 1") probing > to try to exploit L1 cache. > > It's not _apparent_ to me that the mix actually helps ;-) I just > trust that Raymond timed stuff until he was sure it does. To illustrate, I contrived a case where the set optimizations clearly pay off. Output (3.8.1, Win 10 Pro, quad core box with plenty of RAM): 11184810 nice keys dict build0.54 dict iter 0.07 dict lookup 0.81 set build 0.54 set iter 0.08 set lookup0.82 11184810 nasty keys dict build 23.32 dict iter 0.09 dict lookup 13.26 set build 9.79 set iter 0.28 set lookup8.46 I didn't make heroic efforts to keep the machine quiet, and there's substantial variation across runs. For example, there's no real difference beteen "0.07" and "0.08". Some things to note: - The "nice" keys act much the same regardless of whether dict or set. Dicts have an advantage for iteration "in theory" in the absence of deletions because there are no "holes" in the area storing the hash+key+value records. But for these specific keys, the same is true of sets: the initial hash probes are at indices 0, 1, 2, ..., len(the_set)-1, and there are no holes in its hash+key records either (all the empty slots are together "at the right end"). - Sets get a lot of value out of their fancier probe sequence for the nasty keys. There are lots of collisions, and sets can much more often resolve them from L1 cache. Better than a factor of 2 for build time is nothing to sneeze at. - Iteration for sets is substantially slower in this case, for two reasons: (1) sets _do_ have holes for this collection of keys; and, (2) while I don't think it's documented, the maximum load factor for sets is lower than for dicts, and 11184810 was picked to give the maximum load factor for a dict with 2**24 slots. The set in this case has twice as many slots as the dict, and all the extras are empty slots that iteration has to skip over. - I don't have a theory for why dict build time is _so_ much higher than dict lookup time for the nasty keys. - If you don't know a lot about how these things are implemented, just about everything is baffling ;-) Integer keys are important in practice, but things about how modern dicts are implemented were _driven_ by opportunities to exploit properties of the mostly-trivial int hash function. For "general" timing, it's better to use string keys, whose hash codes are "pretty randomish" (but can vary across runs). When using int keys, it's really easy to _end up_ measuring happy behaviors _specific_ to int keys. Code: from time import perf_counter as now import sys from collections import deque def disp(text, f): print(f"{text:20s} {f:5.2f}") M = sys.hash_info.modulus targetlen = 2**24 * 2 // 3 # highest load factor for dict hc = 1 badkeys = [] while len(badkeys) < targetlen: # add no more than 100 ints with hash code `hc` toadd = min(100, targetlen - len(badkeys)) badkeys.extend(hc + i * M for i in range(toadd)) hc += 713 # more than less arbitrary goodkeys = list(range(targetlen)) for tag, keys in ("nice", goodkeys), ("nasty", badkeys): print() print(len(keys), tag, "keys") for thetype, builder in ("dict", dict.fromkeys), ("set", set): s = now() d = builder(keys) f = now() disp(thetype + " build", f-s) s = now() deque(d, maxlen=0) f = now() disp(thetype + " iter", f-s) s = now() for k in keys: assert k in d f = now() disp(thetype + " lookup", f-s) del d ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S3UB5TCSUBLYSESG7WO2HNCFBZB5N4BG/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
>> Also, I believe that max "reasonable" integer range of no collision >> is (-2305843009213693951, 2305843009213693951), ... > Any range that does _not_ contain both -2 and -1 (-1 is an annoying > special case, with hash(-1) == hash(-2) == -2), and spans no more than > sys.hash_info.modulus integers. Apart from that, the sign and > magnitude of the start of the range don't matter; e.g., > > >>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100))) > 100 > >>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100))) > 100 Sorry again! That only shows that the hash codes are unique. Dicts and sets use only the last N bits to determine the start of the probe sequence, for some value of N >= 3 (depending on the table size). So for a table of size a million, the last 20 bits (100 ~= 2**20) are interesting. But same thing: >>> MASK = (1 << 20) - 1 >>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100))) 100 >>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 100))) 100 ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BXHOS73YMJDSVZUM74VYXXYN5WW63BZ4/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Kyle] > ... > For some reason, I had assumed in the back of my head (without > giving it much thought) that the average collision rate would be the > same for set items and dict keys. Thanks for the useful information. I know the theoretical number of probes for dicts, but not for sets anymore. The latter use a mix of probe strategies now, "randomish" jumps (same as for dicts) but also purely linear ("up by 1") probing to try to exploit L1 cache. It's not _apparent_ to me that the mix actually helps ;-) I just trust that Raymond timed stuff until he was sure it does. > ... > Ah, I forgot to consider how the hash function actually works for continuous > integers. A better comparison to demonstrate the collision differences would > likely use random strings. And, at an extreme, a class with a __hash__ that always returns the same integer. > Also, I believe that max "reasonable" integer range of no collision > is (-2305843009213693951, 2305843009213693951), ... Any range that does _not_ contain both -2 and -1 (-1 is an annoying special case, with hash(-1) == hash(-2) == -2), and spans no more than sys.hash_info.modulus integers. Apart from that, the sign and magnitude of the start of the range don't matter; e.g., >>> len(set(hash(i) for i in range(10**5000, 10**5000 + 100))) 100 >>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 100))) 100 ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FXTRZLOHSWYV5KEGJ4DYNS7HOMUOPVE5/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
Sorry! A previous attempt to reply got sent before I typed anything :-( Very briefly: > >>> timeit.timeit("set(i for i in range(1000))", number=100_000) [and other examples using a range of integers] The collision resolution strategy for sets evolved to be fancier than for dicts, to reduce cache misses. This is important for sets because the _only_ interesting thing about an element wrt a set is whether or not the set contains it. Lookup speed is everything for sets. If you use a contiguous range of "reasonable" integers for keys, the integer hash function is perfect: there's never a collision. So any such test misses all the work Raymond did to speed set lookups. String keys have sufficiently "random" hashes to reliably create collisions, though. Cache misses can, of course, have massive effects on timing. > Add (much faster for dicts): > >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000) > 13.330938750001224 > >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000) > 5.788865385999088 In the former case you're primarily measuring the time to look up the "add" method of sets (that's more expensive than adding 0 to the set). A better comparison would, e.g., move `add = s.add` to a setup line, and use plain "add(0)" in the loop. That's it! ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5UKYDB36WDRRIYOLRD52QS4I43SCAAJ6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
. >>> >>> Also, this is mostly speculation since I haven't ran any benchmarks for an >>> OrderedSet implementation, but I strongly suspect that OrderedSet will end >>> up having better average performance for add, pop, and update than trying >>> to use a dictionary with None values (assuming it's implemented in C or at >>> least with a C accelerator). >>> >>> Not to mention allowing the ability to update (for both addition and >>> removal) based on intersections and unions across ordered sets, which >>> currently isn't possible to do with dictionaries (at least not directly >>> with |=, &=, -=. and ^=). I'm not sure how much of a practical use case >>> this would have for ordered sets specifically, but it's a nice bonus. >>> >>> On Sat, Dec 21, 2019 at 7:59 AM Larry Hastings wrote: >>>> >>>> >>>> (mixing together messages in the thread, sorry threaded-view readers) >>>> >>>> On 12/19/19 3:15 PM, Tim Peters wrote: >>>> >>>> [Nick] >>>> >>>> I took Larry's request a slightly different way: >>>> >>>> [...] >>>> >>>> Only Larry can answer whether that would meet his perceived need. My >>>> _guess_ is that he wouldn't know OrderedSet existed, and, even if he >>>> did, he'd use a dict with None values anyway (because it's less hassle >>>> and does everything he wanted). >>>> >>>> >>>> At last, something I can speak knowledgeably about: Larry's use case! >>>> Regarding Larry, I'd say >>>> >>>> his use case was small enough that almost anything maintaining order would >>>> have worked, even a list, >>>> he often does have a pretty good idea what goodies are salted away in the >>>> Python standard library, and >>>> a hypothetical collections.OrderedSet would probably work just fine. And >>>> he'd probably use it too, as that makes the code clearer / easier to read. >>>> If you read code using an OrderedSet, you know what the intent was and >>>> what the semantics are. If you see code using a dict but setting the >>>> values to None, you think "that's crazy" and now you have a puzzle to >>>> solve. Let's hope those comments are accurate! >>>> >>>> >>>> Also, speaking personally, at least once (maybe twice?) in this thread >>>> folks have asked "would YOU, Mr Larry, really want ordered sets if it >>>> meant sets were slightly slower?" >>>> >>>> The first thing I'd say is, I'm not sure why people should care about >>>> what's best for me. That's sweet of you! But you really shouldn't. >>>> >>>> The second thing is, my needs are modest, so the speed hit we're talking >>>> about would likely be statistically insignificant, for me. >>>> >>>> And the third thing is, I don't really put the set() API through much of a >>>> workout anyway. I build sets, I add and remove items, I iterate over >>>> them, I do the occasional union, and only very rarely do I use anything >>>> else. Even then, when I write code where I reach for a fancy operation >>>> like intersection or symmetric_difference--what tends to happen is, I have >>>> a rethink and a refactor, and after I simplify my code I don't need the >>>> fancy operations anymore. I can't remember the last time I used one of >>>> these fancy operations where the code really stuck around for a long time. >>>> >>>> So, personally? Sure, I'd take that tradeoff. As already established, I >>>> like that dict objects maintain their order, and I think it'd be swell if >>>> sets maintained their order too. I suspect the performance hit wouldn't >>>> affect me in any meaningful way, and I could benefit from the >>>> order-maintaining semantics. I bet it'd have other benefits too, like >>>> making regression tests more stable. And my (admittedly uninformed!) >>>> assumption is, the loss of performance would mostly be in these >>>> sophisticated operations that I never seem to use anyway. >>>> >>>> But I don't mistake my needs for the needs of the Python community at >>>> large. I'd be mighty interested
[Python-Dev] Re: Should set objects maintain insertion order too?
[Inada Naoki ] > I just meant the performance of the next(iter(D)) is the most critical part > when you implement orderdset on top of the current dict and use it as a queue. Which is a good point. I added a lot more, though, because Wes didn't even mention queues in his question: [Wes Turner ] How slow and space-inefficient would it be to just implement the set methods on top of dict? I can't guess what he was _really_ asking about, so now he's got lots of answers ;-) > This code should be O(N), but it's O(N^2) if q is implemented on top > of the dict. > >while q: >item = q.popleft() > > Sorry for the confusion. To flesh this out, the FIFO queue entries are all together in a contiguous vector, with no "holes" in the middle. Pushing a new item appends "to the right" (higher index in the vector), while popping an item leaves "a hole" at the left. But there are no holes "in the middle" in this case. So the first real entry is at a higher index with every pop, but iteration starts at index 0 every time. The more pops there are, the more holes iteration has to skip over, one at a time, to find the first real entry remaining. Until pushing a new item would fall off the right end of the vector. Then the table is rebuilt from scratch, moving all the real entries to form a contiguous block starting at index 0 again. >> And the current set implementation (like the older dict >> implementation) never needs to "rebuild the table from scratch" unless >> the cardinality of the set keeps growing. > It is a bit misleading. If "the cardinality of the set" means len(S), Yes. > set requires the rebuild in low frequency if the its items are random. > > Anyway, it is a smaller problem than next(iter(D)) because of the > amortized cost is O(1). For the specific case of FIFO queues, sure. In general, though, "O(1) period". is more desirable than "amortized O(1)". This is especially true when sets get very large. > Current dict is not optimized for queue usage. Nor should it be:-) That's what collections.deque is for, pushes and pops, at both ends, always time O(1) period. No holes, no internal rearrangements, and O(1) "wasted" space. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5VDU52JX2GFM6546W6FCFKXIODDAQKP4/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Wes Turner ] >> How slow and space-inefficient would it be to just implement the set methods >> on top of dict? [Inada Naoki ] > Speed: Dict doesn't cache the position of the first item. Calling > next(iter(D)) repeatedly is O(N) in worst case. > ... See also Raymond's (only) message in this thread. We would also lose low-level speed optimizations specific to the current set implementation. And we would need to define what "insertion order" _means_ for operators (union, intersection, symmetric difference) that build a new set out of other sets without a user-level "insert" in sight. However that's defined, it may constrain possible implementations in ways that are inherently slower (for example, may require the implementation to iterate over the larger operand where they currently iterate over the smaller). And the current set implementation (like the older dict implementation) never needs to "rebuild the table from scratch" unless the cardinality of the set keeps growing. As Raymond telegraphically hinted, the current dict implementation can, at times, in the presence of deletions, require rebuilding the table from scratch even if the dict's maximum size remains bounded. That last can't be seen directly from Python code (rebuilding the table is invisible at the Python level). Here's a short example: d = dict.fromkeys(range(3)) while True: del d[0] d[0] = None Run that with a breakpoint set on dictobject.c's `dictresize()` function. You'll see that it rebuilds the table from scratch every 3rd time through the loop. In effect, for the purpose of deciding whether it needs to rebuild, the current dict implementation sees no difference between adding a new element and deleting an existing element Deleting leaves behind "holes" that periodically have to be squashed out of existence so that insertion order can be maintained in a dead simple way upon future additions. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BVCDKT5ULY324RZATMCEFGAMYOBJFHIZ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Nick] > I must admit that I was assuming without stating that a full OrderedSet > implementation would support the MutableSequence interface. Efficient access via index position too would be an enormous new requirement, My bet: basic operations would need to change from O(1) to O(log(N)). BTW, in previous msgs there are links to various implementations calling themselves "ordered sets". One of them supplies O(1) indexing, but at the expense of making deletion O(N) (!): https://pypi.org/project/ordered-set/ If efficient indexing is really wanted, then the original "use case" Larry gave was definitely obscuring an XY problem ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/IBRSGUTHIMOZ6JGIYJBQJFXEANFZI4V5/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[David Mertz ] > It's not obvious to me that insertion order is even the most obvious or > most commonly relevant sort order. I'm sure it is for Larry's program, but > often a work queue might want some other order. Very often queues > might instead, for example, have a priority number assigned to them. Indeed, and it makes me more cautious that claims for the _usefulness_ (rather than just consistency) of an ordered set are missing an XY problem. The only "natural" value of insertion-order ordering for a dynamic ordered set is that it supports FIFO queues. Well, that's one thing that a deque already excels at, but much more obviously, flexibly, and space- and time- efficiently. For dicts the motivating use cases were much more "static", like preserving textual order of keyword argument specifications, and avoiding gratuitous changing of order when round-tripping key-value encodings like much of JSON. So what problem(s) would a dynamic ordered set really be aiming at? I don't know. Consistency & predictability I understand and appreciate, but little beyond that. Even FIFO ordering is somewhat a PITA, since `next(iter(set))` is an overly elaborate way to need to spell "get the first" if a FIFO queue _is_ a prime dynamic use case. And there's no efficient way at all to access the other end (although I suppose set.pop() would - like dict.popitem() - change to give a _destructive_ way to get at "the other end"). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/J52ATOHFXNBSEJV6QQZBFXSC2TAHOWGW/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Nick] > I took Larry's request a slightly different way: Sorry, I was unclear: by "use case" I had in mind what appeared to me to be the overwhelming thrust of the _entirety_ of this thread so far, not Larry's original request. > he has a use case where he wants order preservation (so built in sets aren't > good), but combined with low cost duplicate identification and elimination and > removal of arbitrary elements (so lists and collections.deque aren't good). Sure. > Organising a work queue that way seems common enough to me to be > worthy of a stdlib solution that doesn't force you to either separate a > "job id" from the "job" object in an ordered dict, or else use an ordered > dict with "None" values. > > So while it means answering "No" to the "Should builtin sets preserve > order?" question (at least for now), adding collections.OrderedSet *would* > address that "duplicate free pending task queue" use case. Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted). But I have to add that I don't know enough about his original use case to be sure it wasn't "an XY problem": https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem """ The XY problem is asking about your attempted solution rather than your actual problem. That is, you are trying to solve problem X, and you think solution Y would work, but instead of asking about X when you run into trouble, you ask about Y. """ Because I've never had a "job queue" problem where an OrderedSet would have helped. Can't guess what's different about Larry's problem. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TV2XVX2NKUGJJZNFKO4TSMEPVQI6Y75Q/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Nick Coghlan ] > Starting with "collections.OrderedSet" seems like a reasonable idea, > though - that way "like a built-in set, but insertion order preserving" will > have an obvious and readily available answer, and it should also > make performance comparisons easier. Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it. The problem is that the "use case" here isn't really a matter of functionality, but of consistency: "it would be nice if", like dicts enjoy now, the iteration order of sets was defined in an implementation-independent way. That itch isn't scratched unless the builtin set type defines it. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PM5ENMLR665XG32AS2FEAEUVDG3AFWV6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Larry] > "I don't care about performance" is not because I'm aching for Python to > run my code slowly. It's because I'm 100% confident that the Python > community will lovingly optimize the implementation. I'm not ;-) > So when I have my language designer hat on, I really don't concern myself > with performance. As I thought I said earlier in the thread, I think we > should > figure out the semantics we want first, and then we figure out how to make it > fast. Pretty much the opposite of how Python started. The original lists and dicts were deliberately designed to have dirt simple implementations, in reaction against ABC's "theoretically optimal" data structures that were a nightmare to maintain, and had so much overhead to support "optimality" in all cases that they were, in fact, much slower in common cases. There isn't magic waiting to be uncovered here: if you want O(1) deletion at arbitrary positions in an ordered sequence, a doubly linked list is _the_ way to do it. That's why, e.g., Raymond said he still uses a doubly linked list, instead of an ordered dict, in the LRU cache implementation. If that isn't clear, a cache can be maintained in least-to-most recently accessed order with an ordered dict like so: if key in cache: cached_value = cache.pop(key) # remove key else: compute cached_value assert key not in cache cache[key] = cached_value # most recently used at the end now return cached_value and deleting the least recently used is just "del cache[next(iter(cache))]" (off topic, just noting this is a fine use for the "first()" function that's the subject of a different thread). We _could_ structure an ordered dict's hash+key+value records as a doubly linked list instead (and avoid the need for O(N) reorganizations). But then we piss away much of the memory savings (to store the new links) that was the _point_ of compact dicts to begin with. So there was a compromise. No links, deletion on its own is O(1), but can periodically require O(N) under-the-covers table reorganizations to squash out the holes. "Suitable enough" for ordered dicts, but _thought_ to be unsuitable for ordered sets (which appear to have higher rates of mixing deletions with insertions - the LRU cache example being an exception). But there are also other optimizations in the current set implementation, so "fine, add the doubly linked list to sets but not to dicts" is only part of it. Which may or may not be possible to match, let alone beat, in an ordered set implementation. A practical barrier now is that Python is too mature to bank on loving optimizations _after_ a change to a core feature is released. It's going to need a highly polished implementation first. I certainly don't object to anyone trying, but it won't be me ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UXWGNLGC3BG2XSMYN5H73FVKP3O2XUUC/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
... [Larry] >> One prominent Python core developer** wanted this feature for years, and I >> recall >> them saying something like: >> >> Guido says, "When a programmer iterates over a dictionary and they see the >> keys >> shift around when the dictionary changes, they learn something!" To that I >> say--"Yes! >> They learn that Python is unreliable and random!" [Tim] > I never wanted ordered dicts, but never objected to them either. All > in all, given that _I_ haven't seen a performance degradation, I like > that they're more predictable, and love the memory savings. That reads like I was correcting Larry for misquoting _me_. Sorry! He wasn't quoting me, and I didn't think he was. What he did quote was delightfully snarky enough that I didn't want to snip it, but not substantial enough to be worth the bother of addressing. So I just moved on to summarize what I said at the time (i.e., nothing). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LXJU3EKB2KM3BT6MTGDLWVHTZUUIZGJK/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Larry] > Didn't some paths also get slightly slower as a result of maintaining > insertion order when mixing insertions and deletions? I paid no attention at the time. But in going from "compact dict" to "ordered dict", deletion all by itself got marginally cheaper. The downside was the need to rearrange the whole dict when too many "holes" built up. "Compact (but unordered) dict" doesn't need that. > My recollection is that that was part of the debate--not only "are we going to > regret inflicting these semantics on posterity, and on other > implementations?", > but also "are these semantics worth the admittedly-small performance hit, in > Python's most important and most-used data structure?". There's a layer of indirection in compact dicts - lookups are distributed across two arrays. In non-compact unordered dicts, everything is in a single array. Cache effects may or may not make a measurable difference then, depending on all sorts of things. > Also, I don't recall anything about us resigning ourselves to explicitly > maintain ordering on dicts because people were relying on it, "battling > windmills", etc. Dict ordering had never been guaranteed, a lack > of guarantee Python had always taken particularly seriously. Rather, we > decided it was a desirable feature, and one worth pursuing even at the > cost of a small loss of performance. I'm not convinced there was a loss of performance. The delay between the implementation supplying ordered dicts, and the language guaranteeing it, was, I suspect, more a matter of getting extensive real-world experience about whether the possible new need to massively rearrange dict internals to remove "holes" would bite someone too savagely to live with. But, again, I wasn't paying attention at the time. > One prominent Python core developer** wanted this feature for years, and I > recall > them saying something like: > > Guido says, "When a programmer iterates over a dictionary and they see the > keys > shift around when the dictionary changes, they learn something!" To that I > say--"Yes! > They learn that Python is unreliable and random!" I never wanted ordered dicts, but never objected to them either. All in all, given that _I_ haven't seen a performance degradation, I like that they're more predictable, and love the memory savings. But as Raymond said (& I elaborated on), there's reason to believe that the implementation of ordered dicts is less suited to sets, where high rates of mixing adds and deletes is more common (thus triggering high rates of massive internal dict rearranging). Which is why I said much earlier that I'd be +1 on ordered sets only when there's an implementation that's as fast as what we have now. Backing up: > Python is the language where speed, correctness, and readability trump > performance every time. Speed trumping performance didn't make any sense ;-) So assuming you didn't intend to type "speed", I think you must have, say, Scheme or Haskell in mind there. "Practicality beats purity" is never seen on forums for those languages. Implementation speed & pain have played huge rules in many Python decisions. As someone who has wrestled with removing the GIL, you should know that better than anyone ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QK3KERIPYD3Q3XNKZJBQBQ6NUUKT63WN/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Tim] >> If it's desired that "insertion order" be consistent across runs, >> platforms, and releases, then what "insertion order" _means_ needs to >> be rigorously defined & specified for all set operations. This was >> comparatively trivial for dicts, because there are, e.g., no >> commutative binary operators defined on dicts. [Larry] > My intuition is that figuring out sensible semantics here is maybe not > trivial, but hardly impossible. If this proposal goes anywhere I'd be > willing to contribute to figuring it out. No, it _is_ easy. It's just tedious, adding piles of words to the docs, and every constraint also constrains possible implementations. You snipped my example of implementing union, which you should really think about instead ;-) >> If you want to insist that `a | b` first list all the elements of a, >> and then all the elements of b that weren't already in a, regardless >> of cost, then you create another kind of unintuitive surprise: in >> general the result of "a | b" will display differently than the result >> of "b | a" (although the results will compare equal), and despite that >> the _user_ didn't "insert" anything. > Call me weird--and I won't disagree--but I find nothing unintuitive about > that. > After all, that's already the world we live in: there are any number of sets > that compare as equal but display differently. In current Python: > > >>> a = {'a', 'b', 'c'} > >>> d = {'d', 'e', 'f'} > >>> a | d > {'f', 'e', 'd', 'a', 'b', 'c'} > >>> d | a > {'f', 'b', 'd', 'a', 'e', 'c'} Yup, it happens But under the sample union implementation I gave, it would never happen when the sets had different cardinalities (the different sizes are used to force a "standard" order then). For mathematical sets, | is commutative (it makes no difference to the result if the arguments are swapped - but can make a _big_ difference to implementation performance unless the implementation is free to pick the better order). > ... > This is also true for dicts, in current Python, which of course do maintain > insertion order. Dicts don't have the | operator, so I approximate the > operation by duplicating the dict (which AFAIK preserves insertion order) Ya, it does, but I don't believe that's documented (it should be). > and using update. Too different to be interesting here - update() isn't commutative. For sets, union, intersection, and symmetric difference are commutative. > ... > Since dicts already behave in exactly that way, I don't think it would be too > surprising if sets behaved that way too. In fact, I think it's a little > surprising > that they behave differently, which I suppose was my whole thesis from > the beginning. I appreciate that dicts and sets behave differently in visible ways now. It just doesn't bother me enough that I'm cool with slowing set operations to "fix that". ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/L2VGSHROPMW4BV6ORYMFKQ4ZJ5AXTZLE/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Petr Viktorin ] > ... > Originally, making dicts ordered was all about performance (or rather > memory efficiency, which falls in the same bucket.) It wasn't added > because it's better semantics-wise. As I tried to flesh out a bit in a recent message, the original "compact dict" idea got all the memory savings, but did _not_ maintain insertion order. Maintaining insertion order too complicated deletions (see recent message), but was deliberately done because people really did want "ordered dicts". > Here's one (very simplified and maybe biased) view of the history of dicts: > > * 2.x: Dicts are unordered, please don't rely on the order. > * 3.3: Dict iteration order is now randomized. We told you not to rely > on it! > * 3.6: We now use an optimized implementation of dict that saves memory! > As a side effect it makes dicts ordered, but that's an implementation > detail, please don't rely on it. > * 3.7: Oops, people are now relying on dicts being ordered. Insisting on > people not relying on it is battling windmills. Also, it's actually > useful sometimes, and alternate implementations can support it pretty > easily. Let's make it a language feature! (Later it turns out > MicroPython can't support it easily. Tough luck.) A very nice summary! My only quibble is as above: the "compact dict" implementation doesn't maintain insertion order naturally, _unless_ there are no deletions (which is, e.g., true of dicts constructed to pass keyword arguments). The code got hairier to maintain insertion order in the presence of mixing insertions and deletions. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TRVOCOLQGSOM2OLLAI3UPRCTFIKIWWH6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Raymond] > ... > * The ordering we have for dicts uses a hash table that indexes into a > sequence. > That works reasonably well for typical dict operations but is unsuitable for > set > operations where some common use cases make interspersed additions > and deletions (that is why the LRU cache still uses a cheaply updated doubly-l > linked list rather that deleting and reinserting dict entries). I'm going to take a stab at fleshing that out a bit: ideally, an ordered dict stores hash+key+value records in a contiguous array with no "holes". That's ("no holes") where the memory savings comes from. The "holes" are in the hash table, which only hold indices _into_ the contiguous array (and the smallest width of C integer indices sufficient to get to every slot in the array). "The problem" is that deletions leave _two_ kinds of holes: one in the hash table, and another in the contiguous vector. The latter holes cannot be filled with subsequent new hash+key+value records because that would break insertion order. So in an app that mixes additions and deletions, the ordered dict needs to be massively rearranged at times to squash out the holes left behind by deletions, effectively moving all the holes to "the end", where they can again be used to reflect insertion order. Unordered dicts never had to be rearranged unless the total size changed "a lot", and that remains true of the set implementation. But in apps mixing adds and deletes, ordered dicts can need massive internal rearrangement at times even if the total size never changes by more than 1. Algorithms doing a lot of mixing of adds and deletes seem a lot more common for sets than for dicts, and so the ordered dict's basic implementation _approach_ is a lot less suitable for sets. Or, at least, that's my best attempt to flesh out Raymond's telegraphic thinking there. Note: the holes left by deletions _wouldn't_ be "a problem" _except_ for maintaining insertion order. If we were only after the memory savings, then on deletion "the last" record in the contiguous array could be moved into the hole at once, leaving the array hole-free again. But that also changes the order. IIRC, that's what the original "compact dict" _did_ do. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QQXSSJWHKTUNHSMSHVM7XLMDBMUV7BDX/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Tim] > BTW, what should > > {1, 2} | {3, 4, 5, 6, 7} > > return as ordered sets? Beats me.; [Larry] > The obvious answer is {1, 2, 3, 4, 5, 6, 7}. Why? An obvious implementation that doesn't ignore performance entirely is: def union(smaller, larger): if len(larger) < len(smaller): smaller, larger = larger, smaller result = larger.copy() for x in smaller: result.add(x) In the example, that would first copy {3, 4, 5, 6, 7}, and then add 1 and 2 (in that order) to it, giving {3, 4, 5, 6, 7, 1, 2} as the result. If it's desired that "insertion order" be consistent across runs, platforms, and releases, then what "insertion order" _means_ needs to be rigorously defined & specified for all set operations. This was comparatively trivial for dicts, because there are, e.g., no commutative binary operators defined on dicts. If you want to insist that `a | b` first list all the elements of a, and then all the elements of b that weren't already in a, regardless of cost, then you create another kind of unintuitive surprise: in general the result of "a | b" will display differently than the result of "b | a" (although the results will compare equal), and despite that the _user_ didn't "insert" anything. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OU37ZU46BCHI6HLA7E3NEWCDOLQOHRNF/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Guido] > ... > the language should not disappoint them, optimization opportunities be damned. I would like to distinguish between two kinds of "optimization opportunities": theoretical ones that may or may not be exploited some day, and those that CPython has _already_ exploited. That is, we don't have a blank slate here. As Raymond said, the set implementation already diverged from the dict implementation in fundamental ways for "go fast" reasons. How much are people willing to see set operations slow down to get this new feature? For me, "none" ;-) Really. I have no particular use for "ordered" sets, but have set-slinging code that benefits _today_ from the "go fast" work Raymond did for them. Analogy: it was always obvious that list.sort() is "better" stable than not, but I ended up crafting a non-stable samplesort variant that ran faster than any stable sort anyone tried to write. For years. So we stuck with that, to avoid slowing down sorting across releases. The stable "timsort" finally managed to be _as_ fast as the older samplesort in almost all cases, and was very much faster in many real-world cases. "Goes faster" was the thing that really sold it, and so much so that its increased memory use didn't count for much against it in comparison. Kinda similarly, "ordered dicts" were (as has already been pointed out) originally introduced as a memory optimization ("compact" dicts), where the "ordered" part fell out more-or-less for free. The same approach wouldn't save memory for sets (or so I'm told), so the original motivation for compact dicts doesn't carry over. So I'm +1 on ordered sets if and only if there's an implementation that's at least as fast as what we already have. If not now, then, like ordered dicts evolved, offer a slower OrderedSet type in the `collections` module for those who really want it, and wait for magic ;-) BTW, what should {1, 2} | {3, 4, 5, 6, 7} return as ordered sets? Beats me.; The imbalance between the operands' cardinalities can be arbitrarily large, and "first make a copy of the larger one, then loop over the smaller one" is the obvious way to implement union to minimize the number of lookups needed. The speed penalty for, e.g., considering "the left" operand's elements' insertion orders to precede all "the right" operand's elements' insertion orders can be arbitrarily large. The concept of "insertion order" just doesn't make well-defined sense to me for any operation the produces a set from multiple input sets, unless it means no more than "whatever order happens to be used internally to add elements to the result set". Dicts don't have anything like that, although dict.update comes close, but in that case the result is defined by mutating the dict via a one-at-a-time loop over the argument dict. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YXSGUPZYTO7TKOVVU32276M54TMITVVQ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Should set objects maintain insertion order too?
[Larry Hastings ] > As of 3.7, dict objects are guaranteed to maintain insertion order. But set > objects make no such guarantee, and AFAIK in practice they don't maintain > insertion order either. If they ever appear to, it's an accident you shouldn't rely on. > Should they? >From Raymond, 22 Dec 2017: https://twitter.com/raymondh/status/944454031870607360 """ Sets use a different algorithm that isn't as amendable to retaining insertion order. Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future. """ Which is more an answer to "will they?" than "should they?" ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WMIZRFJZXI3CD5YMLOC5Z3LXVXD7HM4R/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Macros instead of inline functions?
[Skip Montanaro ] > ... > I don't think stable code which uses macros should be changed (though > I see the INCREF/DECREF macros just call private inline functions, so > some conversion has clearly been done). Still, in new code, shouldn't > the use of macros for more than trivial use cases (constant defs, > simple one-liners) be discouraged at this point? Within reason, I think so. Like C's old "register" declaration, compilers will eventually evolve to make better decisions about what should be done than humans generally do. But there are macros that exist just to reduce repetitive, error-prone typing, and others that set up control flow (like the trashcan macros. or listobject.c's IFLT). Which is the "within reason" part above. There are still fine uses for macros in C. It's just that faking inline functions is no longer one of them ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KGU63DELU4YJTVFALCP55SXQIJ2QN5WJ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Compacting the Uncompactable
Short course: a replacement for malloc for use in contexts that can't "move memory" after an address is passed out, but want/need the benefits of compactification anyway. Key idea: if the allocator dedicates each OS page to requests of a specific class, then consider two pages devoted to the same class, where `-` is free space, and X and Y are allocated chunks: page1: X--XX-X---X page2: --Y--Y--YY- Because there's no overlap in the offsets of allocated blocks here, we can copy the 4 Ys into page1 (or Xs into page2). Then tell the OS to change its page tables so that the virtual addresses of page1 amd page2 _both_ map to the physical page page1 referred to. page2's physical page can be returned to the OS then. No _virtual_ address malloc ever handed out needs to change. The rest is making this all safe & "go fast". On Sat, Sep 28, 2019 at 2:06 PM MRAB wrote: > > Here's a video about memory fragmentation and compaction that you might > find interesting: > > "Compacting the Uncompactable" by Bobby Powers > https://www.youtube.com/watch?v=c1UBJbfR-H0 > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/G6OR45TETKIFZDVWAK5ZGLLFTIC422TG/ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VMP5RXVQV574SXNJC5SNIJNAL3PYZZC6/
[Python-Dev] Re: Spammy stuff (was Re: congrats on 3.5! Alas, windows 7 users are having problems installing it)
[Tim] > While python-dev has several "official" moderators, best I can tell > I'm the only one who has reviewed these messages for years. I should clarify that! That's not meant to be a dig at the other moderators. I review everything because I'm retired and am near the computer many hours almost every day. That's been so for at least a decade. So the others rarely get a chance - the queue is very likely empty if they ever look. After a few years of that, they may well stop looking ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZZXYE2PPVEURQ23CEOFYS5YPBJR5BDDT/
[Python-Dev] Re: Spammy stuff (was Re: congrats on 3.5! Alas, windows 7 users are having problems installing it)
[This is about the mailing list, not about Python development] That python-dev-owner has gotten two complaints about this message so far suggests I should explain what's going on ;-) New list members are automatically moderated. Their posts sit in a moderation queue waiting for moderator action. While python-dev has several "official" moderators, best I can tell I'm the only one who has reviewed these messages for years. So - yup! - I approved this spam. The Subject was legit, the email address was from Google rather than, e.g., some obscure Russian host, and the link was to a site with an obviously computer-related name. "Good enough". Usually, but not in this case. So, sorry, but sometimes one of these just will slip through. I've since fiddled the list to auto-discard future posts from this address, but have no ability to purge the spam from the list archives. There are other not-spam but "low quality" messages I've let through too, since the list migrated to Mailman 3. That's because we lost some very useful moderator functionality (which I've griped about on a Mailman 3 issue tracker): - It's no longer possible to forward a message from the queue to a different address. So, e.g., rather than throw them away, I've let through some newbie messages that I would previously have forwarded to the Python help list. - It's no longer possible to write custom text for a "Reject" action. So, again, rather than throw away messages that should obviously be posted to a different list, with the author getting nothing but a generic explanation-free "rejected" message, I've let them through. Yes, I _could_ use my own my email account and copy/paste/fiddle to forward/reply in such cases, independent of Mailman functionality. And sometimes I do. But there's only so much time I can give to this, and some days I settle for clearing the queue just as quickly as Tim-ly possible. Some of that will get better if/when Mailman 3 restores lost functionality, but I doubt anything can be done that stops spam 100% short of erring on the side of rejecting marginally legit messages too. But that's not how my brain is wired, so you would need a different moderator if that's what you want. In the meantime, I'll continue cashing my lavish moderator checks and just laugh at complaints ;-) On Sun, Sep 15, 2019 at 12:21 AM wrote: > > Check this article here which will help you out. > [useless link] ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3DPJTBS252J3XX5FMFRXTOFWYIUDEO6H/
[Python-Dev] Re: PEP 572 TargetScopeError
[Guido] > I don't see how this debate can avoid a vote in the Steering Council. FWIW, I found Nick's last post wholly persuasive: back off to SyntaxError for now, and think about adding a more specific exception later for _all_ cases (not just walrus) in which a scope conflict isn't allowed (e.g., his example of declaring a function's formal argument `global` in the function). ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7ILVNCFSYMJ4HP5INVA34N4DVEKSEYAU/
[Python-Dev] Re: PEP 572 TargetScopeError
[Barry Warsaw ] > bpo-37757: https://bugs.python.org/issue37757 Really couldn't care less whether it's TargetScopeError or SyntaxError, but don't understand the only rationale given here for preferring the latter: > To me, “TargetScopeError” is pretty obscure and doesn’t give users an > obvious clue as to what’s going wrong, without searching the interwebs. Whereas SyntaxError would give no clue whatsoever, and nothing useful to search for. In contrast, a search for TargetScopeError would presumably find a precisely relevant explanation as the top hit (indeed, it already does today). I don't care because it's unlikely an error most people will make at all - or, for those too clever for their own good, make more than once ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WBWCV7PBDYY6PVSXIX2QX7GKXY54PREU/
[Python-Dev] Re: Can someone please finish merging GH-13482?
[Brett Cannon ] > We probably need to update https://devguide.python.org/committing/ to > have a step-by-step list of how to make a merge works and how to > handle backports instead of the wall of text that we have. (It's already > outdated anyway, e.g. `Misc/ACKS` really isn't important as git itself > records the author of the commit and so that can be used with > `Misc/ACKS` for old commits to gather the list of folks who have" > contributed.) Don't put too much weight on my screwups ;-) I was appalled to hear that the OP's contribution was still sitting unmerged, and was in a hurry to resolve that at a time I _had_ no significant time to give to it. Mariatta and Terry Reedy finished up what I left undone, so in the end it's all good :-) And there's a problem with the GitHub workflow docs that may be unique to me: we have helpful layers of automation, but they're shortening a git workflow I don't understand even without the automation. With the automation, I'm just doubly clueless. That's because I'm old. My capacity to give a rip about source control system quirks was apparently entirely used up when I spent several weeks mastering the intricacies of Mercurial. Try as I might, I just haven't been able to force myself to become competent with git. It's not that I disapprove of git! It's apparently more that we're each born with a finite capacity for being _able_ to learn Yet Another New Source Control System, and mine was used up on YANSCS #8 ;-) aging-isn't-for-the-optimistic-ly y;rs - tim ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ACAJVCPHE2N5RNI4BNL2JK6INMJRPZ2N/
[Python-Dev] Re: Can someone please finish merging GH-13482?
[Mariatta ] - Since this is a 1st-time contributor, does it need a change to the ACKS file? > > I think the change is trivial enough, the misc/acks is not necessary. > > - Anything else? > > > 1. Does it need to be backported? If so, please add the "needs backport to > .." label. > > 2. Add the "🤖 automerge" label. The bot will take care of merging. It > will use the PR title and description as commit message. > If the PR title/ description is not good enough as commit message, edit it > first before adding the "🤖 automerge" label. > > So I'll watch and do it myself next time ;-) > > > Hopefully the above instructions are clear and simple enough, so I'll let > you trigger the automerge this time. > So I tried ;-) No idea what did and didn't work. I got several of these messages: """ Sorry, I can't merge this PR. Reason: Base branch was modified. Review and try the merge again.. """ I'm guessing (don't know) those are about failed auto-backports. I don't have time now to try to figure it all out - It seemed to have spawned a number of new GH PRs for the backports, and I'm thoroughly lost now :-( The change did show up when I pulled from upstream master, so I expect that part did work. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SHIDYDSL43D6ANHGBCGVPMHSGCU5XK2P/
[Python-Dev] Can someone please finish merging GH-13482?
https://github.com/python/cpython/pull/13482 is a simple doc change for difflib, which I approved some months ago. But I don't know the current workflow well enough to finish it myself. Like: - Does something special need to be done for doc changes? - Since this is a 1st-time contributor, does it need a change to the ACKS file? - Anything else? I'm sure this is all obvious after you've done it once, but I haven't. So I'll watch and do it myself next time ;-) ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TFTQCJ7IZNCE3XUWEC4IVHPXBDZIKVEI/