Re: [Factor-talk] new parser

Doug Coleman Fri, 07 Aug 2015 15:28:08 -0700

[[This is kind of a brain-dump and not completely organized, but I'm going
to send it.]]

The proposed "new-parser" is a lexer and parser with specific roles for
each. SYNTAX: words that execute arbitrary code should be replaced with
PARSER: words that only parse text, and a compile pass

The main goals for the new-parser are:

1) allow the new-parser to parse files without compiling them

Since the lexer/parser must know all parsing words before encountering
them, or risk a bad parse, we have to choose between the following:

a) USE:/USING: forms are handled before other code
b) have -syntax.factor files that define PARSER:s and load them all and
force disambiguation
c) keep a metafile with a USING: list, like npm's package.json that pulls
in modules before parsing.
d) something else!

2) to remember the parsed text to allow renaming/moving/deleting
words/vocabularies and other refactoring tools

3) exact usage to allow perfect code reloading/renaming, even for syntax
that "goes away", such as octal literals, with the current parser

4) to avoid having to use backslashes to escape strings by using lua
long-string syntax which allows strings with arbitrary content to be
embedded inside any source file without ambiguity

a) this allows embedding DSLs with any syntax you want

5) allow for better docs/markdown syntax while still being 100% Factor
syntax, or allow registering different file endings so Factor knows how to
handle each file

Lexer algorithm

The lexer takes an entire stream (``utf8 file-contents`` for files) and
parses it into tokens, which are typed slices of the underlying stream
data. The parser sees each token and if the token is a PARSER: then it runs
that token's parser to complete the parse.

``tokens`` the lexer will recognize:

1) single line comments

! this is a comment
!this is a comment
append! ! this is the word append! and a comment
USING: ! the using list, comments are ok anywhere since the lexer knows
kernel math ;

restrictions:
a) words that start with a ``!`` are not allowed, but words ending or with
! in the middle are fine, e.g. append! map!reduce are ok, !append is a
comment

2) typed strings

"regular string! must escape things \"quotes\" etc, but
can be multiline"
resource"core/math/math.factor" ! replaces "resource:core/math/math.factor"
vocab"math" ! replaces "vocab:math"
url"google.com" ! instead of URL" google.com"
sbuf"my string buffer"

restrictions:
c) can't have a " in word names, they will parse as typed strings instead

3) typed array-likes (compile-time)

{ 1 2 }
{ 1 2 3 4 + } ! becomes { 1 2 7 } at compile-time
suffix-array{ 1 2 3 } ! suffix array literal
V{ } ! vector literal
H{ { 1 2 } }
simplify{ "x" 0 * 1 + 3 * 12 + } ! from
http://re-factor.blogspot.com/2015/08/automated-reasoning.html

restrictions:
d) words that end in { parse until the matching } using lexer tokenization
rules

4) typed quotation-likes (run-time)

[ ] ! regular quotation
[ { 1 2 3 + } { 4 5 } v+ ] ! [ { 1 5 } { 4 5 } v+ ] at compile-time
{ { 1 2 3 + } { 4 5 } v+ ] ! { 5 10 } at compile-time
H{ { 1 2 } { 2 3 4 + } [ 5 + ] } ! H{ { 1 2 } { 2 7 } { 5 + } } at
compile-time
simplify[ "x" 0 * 1 + 3 * 12 + ] ! from
http://re-factor.blogspot.com/2015/08/automated-reasoning.html

restrictions:
e) words that end in [ parse until the matching ] using lexer tokenization
rules

5) typed stack annotation word

( a b c -- d ) ! regular stack effect

( a b c ) ! input stack effect, lexical variable assignment

1 2 3 :> ( a b c ) ! current multiple assignment follows the rule

shuffle( a b -- b a ) ! current shuffle word follows this

FUNCTION: int getcwd ( char *buf, size_t size ) ; ! follows the rule

restrictions:
words that end in ( must parse til ) using lexer tokenization rules

6) typed long-strings

[[long string]]

[[This string doesn't need "escapes"\n and is a single line since the
newline is just a "backslash n".]]

[=[embed the long string "[[long string]]"]=]

[==[embed the previous string: embed the long string "[[long
string]]"]=]]==]]

! The current EBNF: syntax still works, but you can also have arbitrary
EBNF literals

CONSTANT: simple-tokenizer-ebnf-literal EBNF[=[
space = [ \t\n\r]
escaped-char = "\\" .:ch => [[ ch ]]
quoted = '"' (escaped-char | [^"])*:a '"' => [[ a ]]
unquoted = (escaped-char | [^ \t\n\r"])+
argument = (quoted | unquoted) => [[ >string ]]
command = space* (argument:a space* => [[ a ]])+:c !(.) => [[ c ]]
]=]

CONSTANT: hello-world c-program[====[
#include <stdio.h>

int main(int argc, char *argv[]) {
    printf("hello\n");
    printf("oh noes, the closing ]] ]=] ]==] ]===]\n");
    return 0;
}
]====]

restrictions:
words that have the following tokens anywhere will parse as long strings:
[= {= [[ {{

- ``[=`` throws an error if any character other than = followed by [ is
found, e.g. ``[======[`` is ok ``[=====   [`` is error
- ``[===[`` parses until ``]===]`` or throws an error

To sum the lexer up:
! starts a comment except within a word
foo" starts a typed foo string
foo{ starts a typed compile-time literal
foo[ starts a typed run-time literal
foo( starts a typed stack annotation word
foo{{ starts a typed compile-time string
foo[[ foo[=[ starts a typed run-time string

I want to add multiline comments, not sure what the syntax would be, but
leaning toward ![[ ![===[ etc so you don't have to deal with C-style
embedded comments, ML-style matched comments (* *) that you can't put
arbitrary text in, etc.

The goal of the long-string is to not have to thing about quoting strings
or nesting comments, which has wasted thousands (millions?) of programmer
hours and caused countless bugs and so much frustration. Triple quoted
strings just hide the problem for awhile, but eventually you will need to
escape something even in a triple-quoted string, e.g. docs about
triple-quoted strings, pages and pages of code you want to just copy/paste
into a literal in the repl, etc. Lua recognizes this, but I am not a Lua
programmer so I don't know the extent to which this solves problems for
people. I think it works even better in Factor and when you allow adding
types to strings for DSLs, module system file handlers, Factor docs syntax,
etc.

If the libertarian/anarchist/free-spirit in you feels troubled by all the
naming restrictions that THE MAN is forcing on you, and really want to name
a word ``key-[`` or have ``funky-literals{ ]``, then we could think about
adding lexing words which override the rules laid out above. The lexer
would see the start of a lexer rule but check if you have overridden it and
act accordingly.

Module system idea:

Handling -docs.factor, -tests.factor:
If we had a docs[[ ]] form, and a tests[[ ]] form, and then you could
register certain file endings/extensions with these parsers, then adding
different files could be automated and simplified.
foo/bar/bar.factor -- loads to a factor[[ ]], an arbitrary factor code
literal
foo/bar/bar-syntax.factor -- loads to a syntax[[]]
foo/bar/bar-docs.factor -- loads with a docs[[]]
foo/bar/bar-tests.factor -- loads to a tests[[]]
bar.c -- is really just a C file, loads into a c-file[[ ]] and factor
compiles it or forwards it to clang or whatever you want

Since you can nest the long-strings arbitrarily, to handle them you just
strip off the delimiters and parse the inside as whatever you want, and you
can even invoke the Factor parser again. Docs could be like this:

! DOCS EXAMPLE in -docs.factor file

$article[=[$link[[interned-words]] $title[[Looking up and creating words]]
A word is said to be $emphasis[[interned]] if it is a member of the
vocabulary named by its vocabulary slot. Otherwise, the word is
$emphasis[[uninterned]].

Parsing words add definitions to the current vocabulary. When a source file
is being parsed, the current vocabulary is initially set to
$vocab-link[[scratchpad]]. The current vocabulary may be changed with the
$link[[IN:]] parsing word (see $link[[word-search]]).
$subsections[
    create-word
    create-word-in
    lookup-word
]
]=]

Or you could just register a markdown[[ ]] hander for .md/.markdown files
and write a markdown to Factor docs compiler, or compile Factor docs to
.markdown etc. The current docs could be converted mechanically.
Suggestions?

More thoughts, unimplemented, open for discussion (as everything is!):

If we wanted to have no PARSER: words at all, we could have another rule:
CAPITAL: words parse until ;
lowercase: take-one-token

This almost works perfectly with the current system, except for words like
GENERIC: GENERIC# etc. However, Slava laid out plans to remove such words
in a blog post from 2008. FUNCTION: currently doesn't have a semi but it
could be added back.

http://factor-language.blogspot.com/2008/06/syntax-proposal-for-multi-methods.html

The advantage of having such a regular syntax is that even non-programmers
can look at the code and see exactly how it parses, with the familiar
syntax of English such as matching parentheses, upper and lower case, and
scanning ahead until a token, which is how sentences work.

Some new-parser words (extra/modern/factor/factor.factor):

lexer primitives:
token - get a lexer token
parse - get a lexer token and run it as a parsing word (if it is one)
raw - bypass the lexer, get a token until whitespace
";" raw-until - call ``raw`` until you hit a ";"
";" parse-until - call parse until you hit a ";'
";EBNF" multiline-string-until - take chars until you hit ;EBNF

lexer syntax sugar, shortcuts:
body - syntax sugar for ``";" parse-until``
new-word - syntax sugar for ``token``, but tags it as a new word
new-class - syntax sugar for ``token``, but tags it as a new class
existing-word - same deal
existing-class - same deal

examples:
QPARSER: qparser QPARSER: raw raw body ;
QPARSER: function : new-word parse body ;
QPARSER: c-function FUNCTION: token new-word parse ;
QPARSER: memo MEMO: new-word parse body ;
QPARSER: postpone POSTPONE: raw ;
QPARSER: symbols SYMBOLS: ";" raw-until ;
QPARSER: char CHAR: raw ;
QPARSER: constant CONSTANT: new-word parse ;
QPARSER: morse [MORSE "MORSE]" multiline-string-until ;

All foo[ foo( foo{ etc don't really need QPARSER: definitions. (Q stands
for quick, as this is the quicker iteration of the parser compared to the
previous implementation ;)

Tools:

I have a tool in another branch that can rename the comment character from
! to whatever you want and rewrite all the Factor files.

The new-parse can parse 4249 source, docs, and tests files in 2.5 seconds
before any extra optimizations, which I'm sure there's potential for.

Compilation:

Compilation will be going a few passes:

1) parse everything into a sequence of definitions
2) iterate the definitions and define new class/word symbols
3) take into account the USING: list and resolve all classes/words, and
numbers. anything that is not these will throw an error
4) output a top-level quotation that compiles all the words at once, where
each word builds its own quotation that ends in ``define``,
``define-generic``, etc.

Other random ideas and consequences:

- can allow circularly dependent vocabularies
- circular definitions
- can remove DEFER:, << >>, POSTPONE: (replace with ``\``), maybe others
- symbols can only be either a class or a word but not both, which is
almost the case now (around five are both still)
- IN: can go away, base it on filename, IN-UNSAFE: can replace IN: for when
it doesn't match filename
- possibly ALL code could be in scope, or scope by USE: core, USE: basis,
etc, and disambiguate as needed
- need to reexamine Joe's module system proposal for ideas
https://github.com/slavapestov/factor/issues/641
https://gist.github.com/jckarter/3440892

Road map:
I need a few days of large blocks of uninterrupted time to get things to
compile and reload correctly. Compiler invoker (not the actual compiler),
refactoring tools, source writer need to be fixed up. Walker tool needs to
be rewritten but can handle parsing words unexpanded, local variables, etc.
Help is welcome!

The code so far:

The parser works as described but without long-string comments. The other
vocabs are kind of half-baked but I have written them a couple of times to
vary degrees of completeness. The current parser can parse the entire
Factor codebase without erroring. It still may have some problems writing
files back but those can be ironed out because I did it once before.

git remote add erg https://github.com/erg/factor.git
git fetch erg
git checkout modern4
code is in extra/modern/

"modern" load
all-factor-files [ dup quick-parse-path ] { } map>assoc

"1 2 3" qparse >out .
"math" qparse-vocab
"math" qparse-vocab.

Let me know what you think about any of this!

Doug

On Fri, Aug 7, 2015 at 2:52 PM, Jon Harper <jon.harpe...@gmail.com> wrote:

> Hi Doug,
> so I guess everyone has been teased with all the clues about the new
> parser :)
> 1fcf96cada0737 says "something else soon.",
> https://github.com/slavapestov/factor/issues/1398 mentions it, etc.
>
> Could you share your plans for the new parser ? How will it be different,
> what will it improve, etc ?
>
> Thanks,
> Jon
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk
>
>

------------------------------------------------------------------------------

_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] new parser

Reply via email to