Re: [Factor-talk] new parser

2015-08-07 Thread Doug Coleman
I just now added the ![[ ]] comment syntax and fixed !comments.

Force-pushed to the erg/modern4 branch.

On Fri, Aug 7, 2015 at 5:27 PM, Doug Coleman  wrote:

> [[This is kind of a brain-dump and not completely organized, but I'm going
> to send it.]]
>
> The proposed "new-parser" is a lexer and parser with specific roles for
> each. SYNTAX: words that execute arbitrary code should be replaced with
> PARSER: words that only parse text, and a compile pass
>
> The main goals for the new-parser are:
>
> 1) allow the new-parser to parse files without compiling them
>
> Since the lexer/parser must know all parsing words before encountering
> them, or risk a bad parse, we have to choose between the following:
>
> a) USE:/USING: forms are handled before other code
> b) have -syntax.factor files that define PARSER:s and load them all and
> force disambiguation
> c) keep a metafile with a USING: list, like npm's package.json that pulls
> in modules before parsing.
> d) something else!
>
> 2) to remember the parsed text to allow renaming/moving/deleting
> words/vocabularies and other refactoring tools
>
> 3) exact usage to allow perfect code reloading/renaming, even for syntax
> that "goes away", such as octal literals, with the current parser
>
> 4) to avoid having to use backslashes to escape strings by using lua
> long-string syntax which allows strings with arbitrary content to be
> embedded inside any source file without ambiguity
>
> a) this allows embedding DSLs with any syntax you want
>
> 5) allow for better docs/markdown syntax while still being 100% Factor
> syntax, or allow registering different file endings so Factor knows how to
> handle each file
>
> Lexer algorithm
>
> The lexer takes an entire stream (``utf8 file-contents`` for files) and
> parses it into tokens, which are typed slices of the underlying stream
> data. The parser sees each token and if the token is a PARSER: then it runs
> that token's parser to complete the parse.
>
> ``tokens`` the lexer will recognize:
>
> 1) single line comments
>
> ! this is a comment
> !this is a comment
> append! ! this is the word append! and a comment
> USING: ! the using list, comments are ok anywhere since the lexer knows
> kernel math ;
>
> restrictions:
> a) words that start with a ``!`` are not allowed, but words ending or with
> ! in the middle are fine, e.g. append! map!reduce are ok, !append is a
> comment
>
>
> 2) typed strings
>
> "regular string! must escape things \"quotes\" etc, but
> can be multiline"
> resource"core/math/math.factor" ! replaces "resource:core/math/math.factor"
> vocab"math" ! replaces "vocab:math"
> url"google.com" ! instead of URL" google.com"
> sbuf"my string buffer"
>
> restrictions:
> c) can't have a " in word names, they will parse as typed strings instead
>
>
> 3) typed array-likes (compile-time)
>
> { 1 2 }
> { 1 2 3 4 + } ! becomes { 1 2 7 } at compile-time
> suffix-array{ 1 2 3 } ! suffix array literal
> V{ } ! vector literal
> H{ { 1 2 } }
> simplify{ "x" 0 * 1 + 3 * 12 + } ! from
> http://re-factor.blogspot.com/2015/08/automated-reasoning.html
>
> restrictions:
> d) words that end in { parse until the matching } using lexer tokenization
> rules
>
>
> 4) typed quotation-likes (run-time)
>
> [ ] ! regular quotation
> [ { 1 2 3 + } { 4 5 } v+ ] ! [ { 1 5 } { 4 5 } v+ ] at compile-time
> { { 1 2 3 + } { 4 5 } v+ ] ! { 5 10 } at compile-time
> H{ { 1 2 } { 2 3 4 + } [ 5 + ] } ! H{ { 1 2 } { 2 7 } { 5 + } } at
> compile-time
> simplify[ "x" 0 * 1 + 3 * 12 + ] ! from
> http://re-factor.blogspot.com/2015/08/automated-reasoning.html
>
> restrictions:
> e) words that end in [ parse until the matching ] using lexer tokenization
> rules
>
>
> 5) typed stack annotation word
>
> ( a b c -- d ) ! regular stack effect
>
> ( a b c ) ! input stack effect, lexical variable assignment
>
> 1 2 3 :> ( a b c ) ! current multiple assignment follows the rule
>
> shuffle( a b -- b a ) ! current shuffle word follows this
>
> FUNCTION: int getcwd ( char *buf, size_t size ) ; ! follows the rule
>
> restrictions:
> words that end in ( must parse til ) using lexer tokenization rules
>
>
> 6) typed long-strings
>
> [[long string]]
>
> [[This string doesn't need "escapes"\n and is a single line since the
> newline is just a "backslash n".]]
>
> [=[embed the long string "[[long string]]"]=]
>
> [==[embed the previous string: embed the long string "[[long
> string]]"]=]]==]]
>
> ! The current EBNF: syntax still works, but you can also have arbitrary
> EBNF literals
>
> CONSTANT: simple-tokenizer-ebnf-literal EBNF[=[
> space = [ \t\n\r]
> escaped-char = "\\" .:ch => [[ ch ]]
> quoted = '"' (escaped-char | [^"])*:a '"' => [[ a ]]
> unquoted = (escaped-char | [^ \t\n\r"])+
> argument = (quoted | unquoted) => [[ >string ]]
> command = space* (argument:a space* => [[ a ]])+:c !(.) => [[ c ]]
> ]=]
>
> CONSTANT: hello-world c-program[[
> #include 
>
> int main(int argc, char *argv[]) {
> printf("hello\n");
> printf("oh noes, the closing ]] ]=]

Re: [Factor-talk] new parser

2015-08-07 Thread Doug Coleman
[[This is kind of a brain-dump and not completely organized, but I'm going
to send it.]]

The proposed "new-parser" is a lexer and parser with specific roles for
each. SYNTAX: words that execute arbitrary code should be replaced with
PARSER: words that only parse text, and a compile pass

The main goals for the new-parser are:

1) allow the new-parser to parse files without compiling them

Since the lexer/parser must know all parsing words before encountering
them, or risk a bad parse, we have to choose between the following:

a) USE:/USING: forms are handled before other code
b) have -syntax.factor files that define PARSER:s and load them all and
force disambiguation
c) keep a metafile with a USING: list, like npm's package.json that pulls
in modules before parsing.
d) something else!

2) to remember the parsed text to allow renaming/moving/deleting
words/vocabularies and other refactoring tools

3) exact usage to allow perfect code reloading/renaming, even for syntax
that "goes away", such as octal literals, with the current parser

4) to avoid having to use backslashes to escape strings by using lua
long-string syntax which allows strings with arbitrary content to be
embedded inside any source file without ambiguity

a) this allows embedding DSLs with any syntax you want

5) allow for better docs/markdown syntax while still being 100% Factor
syntax, or allow registering different file endings so Factor knows how to
handle each file

Lexer algorithm

The lexer takes an entire stream (``utf8 file-contents`` for files) and
parses it into tokens, which are typed slices of the underlying stream
data. The parser sees each token and if the token is a PARSER: then it runs
that token's parser to complete the parse.

``tokens`` the lexer will recognize:

1) single line comments

! this is a comment
!this is a comment
append! ! this is the word append! and a comment
USING: ! the using list, comments are ok anywhere since the lexer knows
kernel math ;

restrictions:
a) words that start with a ``!`` are not allowed, but words ending or with
! in the middle are fine, e.g. append! map!reduce are ok, !append is a
comment


2) typed strings

"regular string! must escape things \"quotes\" etc, but
can be multiline"
resource"core/math/math.factor" ! replaces "resource:core/math/math.factor"
vocab"math" ! replaces "vocab:math"
url"google.com" ! instead of URL" google.com"
sbuf"my string buffer"

restrictions:
c) can't have a " in word names, they will parse as typed strings instead


3) typed array-likes (compile-time)

{ 1 2 }
{ 1 2 3 4 + } ! becomes { 1 2 7 } at compile-time
suffix-array{ 1 2 3 } ! suffix array literal
V{ } ! vector literal
H{ { 1 2 } }
simplify{ "x" 0 * 1 + 3 * 12 + } ! from
http://re-factor.blogspot.com/2015/08/automated-reasoning.html

restrictions:
d) words that end in { parse until the matching } using lexer tokenization
rules


4) typed quotation-likes (run-time)

[ ] ! regular quotation
[ { 1 2 3 + } { 4 5 } v+ ] ! [ { 1 5 } { 4 5 } v+ ] at compile-time
{ { 1 2 3 + } { 4 5 } v+ ] ! { 5 10 } at compile-time
H{ { 1 2 } { 2 3 4 + } [ 5 + ] } ! H{ { 1 2 } { 2 7 } { 5 + } } at
compile-time
simplify[ "x" 0 * 1 + 3 * 12 + ] ! from
http://re-factor.blogspot.com/2015/08/automated-reasoning.html

restrictions:
e) words that end in [ parse until the matching ] using lexer tokenization
rules


5) typed stack annotation word

( a b c -- d ) ! regular stack effect

( a b c ) ! input stack effect, lexical variable assignment

1 2 3 :> ( a b c ) ! current multiple assignment follows the rule

shuffle( a b -- b a ) ! current shuffle word follows this

FUNCTION: int getcwd ( char *buf, size_t size ) ; ! follows the rule

restrictions:
words that end in ( must parse til ) using lexer tokenization rules


6) typed long-strings

[[long string]]

[[This string doesn't need "escapes"\n and is a single line since the
newline is just a "backslash n".]]

[=[embed the long string "[[long string]]"]=]

[==[embed the previous string: embed the long string "[[long
string]]"]=]]==]]

! The current EBNF: syntax still works, but you can also have arbitrary
EBNF literals

CONSTANT: simple-tokenizer-ebnf-literal EBNF[=[
space = [ \t\n\r]
escaped-char = "\\" .:ch => [[ ch ]]
quoted = '"' (escaped-char | [^"])*:a '"' => [[ a ]]
unquoted = (escaped-char | [^ \t\n\r"])+
argument = (quoted | unquoted) => [[ >string ]]
command = space* (argument:a space* => [[ a ]])+:c !(.) => [[ c ]]
]=]

CONSTANT: hello-world c-program[[
#include 

int main(int argc, char *argv[]) {
printf("hello\n");
printf("oh noes, the closing ]] ]=] ]==] ]===]\n");
return 0;
}
]]


restrictions:
words that have the following tokens anywhere will parse as long strings:
[= {= [[ {{

- ``[=`` throws an error if any character other than = followed by [ is
found, e.g. ``[==[`` is ok ``[=   [`` is error
- ``[===[`` parses until ``]===]`` or throws an error


To sum the lexer up:
! starts a comment except within a word
foo" starts a typed foo s

[Factor-talk] new parser

2015-08-07 Thread Jon Harper
Hi Doug,
so I guess everyone has been teased with all the clues about the new parser
:)
1fcf96cada0737 says "something else soon.",
https://github.com/slavapestov/factor/issues/1398 mentions it, etc.

Could you share your plans for the new parser ? How will it be different,
what will it improve, etc ?

Thanks,
Jon
--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk