> I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.
Greg, what do you think if instead of not writing it to the pyc file with -OO or adding a header entry to decide to read/write, we place None in the field? That way we can leverage the option that we intend to add to deactivate displaying the traceback new information to reduce the data in the pyc files. The only problem is that there will be still a tiny bit of overhead: an extra object per code object (None), but that's much much better than something that scales with the number of instructions :) What's your opinion on this? On Sat, 8 May 2021 at 21:45, Gregory P. Smith <g...@krypto.org> wrote: > > On Sat, May 8, 2021 at 1:32 PM Pablo Galindo Salgado <pablog...@gmail.com> > wrote: > >> > We can't piggy back on -OO as the only way to disable this, it needs >> to have an option of its own. -OO is unusable as code that relies on >> "doc"strings as application data such as >> http://www.dabeaz.com/ply/ply.html exists. >> >> -OO is the only sensible way to disable the data. There are two things to >> disable: >> > > nit: I wouldn't choose the word "sensible" given that -OO is already > fundamentally unusable without knowing if any code in your entire > transitive dependencies might depend on the presence of docstrings... > > >> >> * The data in pyc files >> * Printing the exception highlighting >> >> Printing the exception highlighting can be disabled via combo of >> environment variable / -X option but collecting the data can only be >> disabled by -OO. The reason is that this will end in pyc files >> so when the data is not there, a different kind of pyc files need to be >> produced and I really don't want to have another set of pyc file extension >> just to deactivate this. Notice that also a configure >> time variable won't work because it will cause crashes when reading pyc >> files produced by the interpreter compiled without the flag. >> > > I don't think the optional existence of column number information needs a > different kind of pyc file. Just a flag in a pyc file's header at most. > It isn't a new type of file. > > >> On Sat, 8 May 2021 at 21:13, Gregory P. Smith <g...@krypto.org> wrote: >> >>> >>> >>> On Sat, May 8, 2021 at 11:58 AM Pablo Galindo Salgado < >>> pablog...@gmail.com> wrote: >>> >>>> Hi Brett, >>>> >>>> Just to be clear, .pyo files have not existed for a while: >>>>> https://www.python.org/dev/peps/pep-0488/. >>>> >>>> >>>> Whoops, my bad, I wanted to refer to the pyc files that are generated >>>> with -OO, which have the "opt-2" prefix. >>>> >>>> This only kicks in at the -OO level. >>>> >>>> >>>> I will correct the PEP so it reflex this more exactly. >>>> >>>> I personally prefer the idea of dropping the data with -OO since if >>>>> you're stripping out docstrings you're already hurting introspection >>>>> capabilities in the name of memory. Or one could go as far as to introduce >>>>> -Os to do -OO plus dropping this extra data. >>>> >>>> >>>> This is indeed the plan, sorry for the confusion. The opt-out mechanism >>>> is using -OO, precisely as we are already dropping other data. >>>> >>> >>> We can't piggy back on -OO as the only way to disable this, it needs to >>> have an option of its own. -OO is unusable as code that relies on >>> "doc"strings as application data such as >>> http://www.dabeaz.com/ply/ply.html exists. >>> >>> -gps >>> >>> >>>> >>>> Thanks for the clarifications! >>>> >>>> >>>> >>>> On Sat, 8 May 2021 at 19:41, Brett Cannon <br...@python.org> wrote: >>>> >>>>> >>>>> >>>>> On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado < >>>>> pablog...@gmail.com> wrote: >>>>> >>>>>> Although we were originally not sympathetic with it, we may need to >>>>>> offer an opt-out mechanism for those users that care about the impact of >>>>>> the overhead of the new data in pyc files >>>>>> and in in-memory code objectsas was suggested by some folks (Thomas, >>>>>> Yury, and others). For this, we could propose that the functionality will >>>>>> be deactivated along with the extra >>>>>> information when Python is executed in optimized mode (``python -O``) >>>>>> and therefore pyo files will not have the overhead associated with the >>>>>> extra required data. >>>>>> >>>>> >>>>> Just to be clear, .pyo files have not existed for a while: >>>>> https://www.python.org/dev/peps/pep-0488/. >>>>> >>>>> >>>>>> Notice that Python >>>>>> already strips docstrings in this mode so it would be "aligned" with >>>>>> the current mechanism of optimized mode. >>>>>> >>>>> >>>>> This only kicks in at the -OO level. >>>>> >>>>> >>>>>> >>>>>> Although this complicates the implementation, it certainly is still >>>>>> much easier than dealing with compression (and more useful for those that >>>>>> don't want the feature). Notice that we also >>>>>> expect pessimistic results from compression as offsets would be quite >>>>>> random (although predominantly in the range 10 - 120). >>>>>> >>>>> >>>>> I personally prefer the idea of dropping the data with -OO since if >>>>> you're stripping out docstrings you're already hurting introspection >>>>> capabilities in the name of memory. Or one could go as far as to introduce >>>>> -Os to do -OO plus dropping this extra data. >>>>> >>>>> As for .pyc file size, I personally wouldn't worry about it. If >>>>> someone is that space-constrained they either aren't using .pyc files or >>>>> are only shipping a single set of .pyc files under -OO and skipping source >>>>> code. And .pyc files are an implementation detail of CPython so there >>>>> shouldn't be too much of a concern for other interpreters. >>>>> >>>>> -Brett >>>>> >>>>> >>>>>> >>>>>> On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado < >>>>>> pablog...@gmail.com> wrote: >>>>>> >>>>>>> One last note for clarity: that's the increase of size in the >>>>>>> stdlib, the increase of size >>>>>>> for pyc files goes from 28.471296MB to 34.750464MB, which is an >>>>>>> increase of 22%. >>>>>>> >>>>>>> On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado < >>>>>>> pablog...@gmail.com> wrote: >>>>>>> >>>>>>>> Some update on the numbers. We have made some draft implementation >>>>>>>> to corroborate the >>>>>>>> numbers with some more realistic tests and seems that our original >>>>>>>> calculations were wrong. >>>>>>>> The actual increase in size is quite bigger than previously >>>>>>>> advertised: >>>>>>>> >>>>>>>> Using bytes object to encode the final object and marshalling that >>>>>>>> to disk (so using uint8_t) as the underlying >>>>>>>> type: >>>>>>>> >>>>>>>> BEFORE: >>>>>>>> >>>>>>>> ❯ ./python -m compileall -r 1000 Lib > /dev/null >>>>>>>> ❯ du -h Lib -c --max-depth=0 >>>>>>>> 70M Lib >>>>>>>> 70M total >>>>>>>> >>>>>>>> AFTER: >>>>>>>> ❯ ./python -m compileall -r 1000 Lib > /dev/null >>>>>>>> ❯ du -h Lib -c --max-depth=0 >>>>>>>> 76M Lib >>>>>>>> 76M total >>>>>>>> >>>>>>>> So that's an increase of 8.56 % over the original value. This is >>>>>>>> storing the start offset and end offset with no compression >>>>>>>> whatsoever. >>>>>>>> >>>>>>>> On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado < >>>>>>>> pablog...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi there, >>>>>>>>> >>>>>>>>> We are preparing a PEP and we would like to start some early >>>>>>>>> discussion about one of the main aspects of the PEP. >>>>>>>>> >>>>>>>>> The work we are preparing is to allow the interpreter to produce >>>>>>>>> more fine-grained error messages, pointing to >>>>>>>>> the source associated to the instructions that are failing. For >>>>>>>>> example: >>>>>>>>> >>>>>>>>> Traceback (most recent call last): >>>>>>>>> >>>>>>>>> File "test.py", line 14, in <module> >>>>>>>>> >>>>>>>>> lel3(x) >>>>>>>>> >>>>>>>>> ^^^^^^^ >>>>>>>>> >>>>>>>>> File "test.py", line 12, in lel3 >>>>>>>>> >>>>>>>>> return lel2(x) / 23 >>>>>>>>> >>>>>>>>> ^^^^^^^ >>>>>>>>> >>>>>>>>> File "test.py", line 9, in lel2 >>>>>>>>> >>>>>>>>> return 25 + lel(x) + lel(x) >>>>>>>>> >>>>>>>>> ^^^^^^ >>>>>>>>> >>>>>>>>> File "test.py", line 6, in lel >>>>>>>>> >>>>>>>>> return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) >>>>>>>>> >>>>>>>>> ^^^^^^^^^^^^^^^^^^^^^ >>>>>>>>> >>>>>>>>> TypeError: 'NoneType' object is not subscriptable >>>>>>>>> >>>>>>>>> The cost of this is having the start column number and end >>>>>>>>> column number information for every bytecode instruction >>>>>>>>> and this is what we want to discuss (there is also some stack cost >>>>>>>>> to re-raise exceptions but that's not a big problem in >>>>>>>>> any case). Given that column numbers are not very big compared >>>>>>>>> with line numbers, we plan to store these as unsigned chars >>>>>>>>> or unsigned shorts. We ran some experiments over the standard >>>>>>>>> library and we found that the overhead of all pyc files is: >>>>>>>>> >>>>>>>>> * If we use shorts, the total overhead is ~3% (total size 28MB and >>>>>>>>> the extra size is 0.88 MB). >>>>>>>>> * If we use chars. the total overhead is ~1.5% (total size 28 MB >>>>>>>>> and the extra size is 0.44MB). >>>>>>>>> >>>>>>>>> One of the disadvantages of using chars is that we can only report >>>>>>>>> columns from 1 to 255 so if an error happens in a column >>>>>>>>> bigger than that then we would have to exclude it (and not show >>>>>>>>> the highlighting) for that frame. Unsigned short will allow >>>>>>>>> the values to go from 0 to 65535. >>>>>>>>> >>>>>>>>> Unfortunately these numbers are not easily compressible, as every >>>>>>>>> instruction would have very different offsets. >>>>>>>>> >>>>>>>>> There is also the possibility of not doing this based on some >>>>>>>>> build flag on when using -O to allow users to opt out, but given the >>>>>>>>> fact >>>>>>>>> that these numbers can be quite useful to other tools like >>>>>>>>> coverage measuring tools, tracers, profilers and the such adding >>>>>>>>> conditional >>>>>>>>> logic to many places would complicate the implementation >>>>>>>>> considerably and will potentially reduce the usability of those tools >>>>>>>>> so we >>>>>>>>> prefer >>>>>>>>> not to have the conditional logic. We believe this is extra cost >>>>>>>>> is very much worth the better error reporting but we understand and >>>>>>>>> respect >>>>>>>>> other points of view. >>>>>>>>> >>>>>>>>> Does anyone see a better way to encode this information **without >>>>>>>>> complicating a lot the implementation**? What are people thoughts on >>>>>>>>> the >>>>>>>>> feature? >>>>>>>>> >>>>>>>>> Thanks in advance, >>>>>>>>> >>>>>>>>> Regards from cloudy London, >>>>>>>>> Pablo Galindo Salgado >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>> Python-Dev mailing list -- python-dev@python.org >>>>>> To unsubscribe send an email to python-dev-le...@python.org >>>>>> https://mail.python.org/mailman3/lists/python-dev.python.org/ >>>>>> Message archived at >>>>>> https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TYPAMB4EKW6HJL77ORDYQRJEFG/ >>>>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>>>> >>>>> _______________________________________________ >>>> Python-Dev mailing list -- python-dev@python.org >>>> To unsubscribe send an email to python-dev-le...@python.org >>>> https://mail.python.org/mailman3/lists/python-dev.python.org/ >>>> Message archived at >>>> https://mail.python.org/archives/list/python-dev@python.org/message/PDWYJ55Z4XH6OHUQ7IDEG23GWIP6GJOT/ >>>> Code of Conduct: http://python.org/psf/codeofconduct/ >>>> >>>
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TWJ3WSTC3OIWV2RNUUZS5R6IQ7GXMHNN/ Code of Conduct: http://python.org/psf/codeofconduct/