Will _anything_ be able to truly parse and understand perl?

Adam Kennedy Wed, 24 Nov 2004 17:10:36 -0800

Hi folks

I thought it was about time I brought some concerns I've been having lately to the list. Not so much on any particular problem with perl6, but on problems with perl5 we would seem to have the opportunity to fix but aren't. (So far as I can tell).

One of the biggest problems I have had with perl5 is that nothing, not even perl itself, can truly actually "parse" Perl source. By this, I mean "parse" in the sense of reading a chunk of bytes of Perl source and understanding what they mean. Call it "document parsing" if you wish.

perl itself would also appear unable to understand perl source, instead doing what I would call RIBRIB parsing, "Read a bit, run a bit". The "parsing" of perl source is itself merely part of the first execution phase (BEGIN).

If we leave source filters out of it for now, the main problems in regards to document parsing in the current Perl is caused by the interaction of prototypes and operator/operand context.

As the most common example, in order to know what the slash "/" character is (division or regex), you have to know whether you are in operator or operand context, which requires know what things are parameters for subroutines and which aren't, and you can't keep track of that without tracking all of the prototypes for every function both CORE and in the symbol table as you go, and you can't do THAT without loading every single module dependency and running a parse/BEGIN-phase-execution on all of the files, and you can't do THAT without having a perl interpreter to execute it all in.

Any attempt to use the perl interpreter to "parse" code to understand it in any way is both unpredictable and dangerous, due to the common situation of not having a platform that can fully run the code (and all it's dependencies) and all the potentially dangerous side-effects.

If you can't load and BEGIN-phase-execute every single one of the dependencies, you can't parse. At all... ever!

use Win32::Something;
1;

Unparsable on Unix...

use Win32::Something;
use Proc::ProcessTable;
1;

Unparsable on anything... even if I just want "1" syntax highlighted as a number.

BEGIN { system 'rm -rf /'; }
1;

eeep!

If anyone checks in a broken version of a module into CVS that is part of some large project you are working on, sorry can't parse anything any more even to try to hunt down the problem.

For an more comprehensive example, take a look at Acme::BadExample, which uses absolutely plain and simple syntax, yet is completely unparsable. (The reward is as-yet unclaimed)

Go have a look now, I'll wait...

All of this creates HUGE headaches if we want to start adding some intelligence to code analysis and manipulation. We've all seen the lengths Komodo has had to go to by continuously running the code.

What few attempts there have been to modify code are fairly impotent. While you _can_ get source back from B:: it isn't particularly useful except as a way to serialise anonymous code for storage or transport.

Given $source, B:: throws away all syntax and commenting and POD and __DATA__ and whatever is after __END__ and then dumps back out something quite different from what went in. A sort of Frontpage effect. It's what the program thinks is close enough to be the same, but definitely not what you wanted.

Now, despite _all_ of these problems, the continuous insistence from the entire Perl community that writing a perl parser is impossible ("only perl can parse Perl"), and despite the fact I had seen several other people try and fail, or give up, or not really get started, I decided to redefine the problem slightly and have managed to get a working Perl parser up and running.

Or at least, well enough to handle selfgol and 90% of CPAN (I haven't really started working on corner cases yet, after which I expect to reach about 99%), and to do so needing ONLY what is contained in the .pm or .pl file and nothing else. The parser can read all of Acme::BadExample safely and write it back out again unchanged.

In any case, it works and works well enough to start building a number of cool toys on, such as normalisation and comparison of code, code metrics (Leon had a play with this), syntax highlighting, style analysis, the CPAN Cross Reference, and various other stuff that is staying on the whiteboard until API-freeze is finished. (back/forwardporting, refactoring, auto-documentation, smart perl diffs, "safe" testing of code, better dependency and version extraction, checkstyle, a refactoring perl editor ala IntelliJ IDEA, etc etc etc).

And most importantly, because it treats the Perl source as a document (data structure) and not as code (procedural execution) it can serialise back to the source code which will be identical to what it read in. That is, it is totally round-trip safe (100% in testing of a CPAN subset of 5,500 perl files)

So $source -> $DocumentObject -> $source is safe.

Now of course, it is completely unable to deal with source filters. There is some talk of adding extendability in some how, but that's an idea for another year (or decade).

Keeping the normal slight changes in grammar under control has been bloody hard, but dealing with arbitrary grammar manipulation would be just plain impossible.

But then I'm fairly comfortable in that source filters are consider scary and dangerous and everyone knows not to play with them unless you really need to, so it's fine to say "PPI does not support source filters".

Getting (finally) to perl6, I could have sworn I saw an RFC early on which said "Make perl6 easier to parse".

But it would appear the opposite is occurring. Source filters have become grammars and will now be officially approved and acceptable (yes?) while so far as I can tell the problem of prototype vs operator/operand interaction is not being addressed. (I'm a little in the dark here, perhaps it is and nobody has noticed enough)

What information I have managed to get from MAGNet #perl suggests that the "approved" way of manipulating code will be to parse it via a grammar into primitives, manipulate the primitives and then write it back out as those primitives.

Excuse my terminology here if I'm not using the exact terms you guys have been using.

But in any case, I take it the grammars are like Scheme "MACROS" (correct term?) or source filters and only work in one direction. That is, you can't take the result of whatever the grammar transformation is and reverse it back into the original code.

Suggesting to #perl that manipulating code this way and writing it back would completely destroy the code (certainly from a maintainability standpoint) seemed to just seemed to get shrugs from the audience.

I know it's probably a bit late at this point to making huge changes (at the time the RFCs were being done I wasn't really confident enough in my knowledge to suggest anything), but I really would like to make the point that by going down this route of becoming less parsable we may well be sacrificing a huge range of potential analysis and manipulation technologies by making perl source even more impossible to document-parse than it already is.

Any comments or feedback you have on the issue of parsability would be welcome. I'm not sure if Damian is involved in Perl 6 language stuff any more, but if he or anyone else language-related is going to be at YAPC.AU next week I would dearly love to meet up and have a chat.

My currently-being-API-frozen perl parser is visible at

http://search.cpan.org/~adamk/PPI-0.831/

Thanks for your time

Adam Kennedy

Will _anything_ be able to truly parse and understand perl?

Reply via email to