I started a JS parser today (named "parsen" :-) ).
I have no github account yet so I put the code as an attachement
("parsen.main.cpp").
The objective is to make a program which parse its own code (the JS
output from emscripten).
For the moment, I have a lexer. It split lines ("\n"), spaces (blanks),
words ("alnum"), brackets (open/close, curly, square ...).
You can look at the "bool parse(const char* path)" in the source file.
I want to provide the JS AST as a flat indexed array (a parent/child
table).
Each cell should have:
-its type (a friendly C++11 "enum class",
"function","var","+" ...)
-the index of its parent (in the array)
-line/column number in the source file
-its name/content ou value (floats ...)
-...
I can also decode the special instructions of asm.js ("|0" ...).
The final idea is to be able to run queries over the AST (like LINQ).
Do something like:
ast.select("function").select("var").dump();
Can you tell me more about your needs? What kind of patterns are you
looking for from the optimizer?
PS: Sorry for the many mistakes. I'm french and not good at all in
english!
Le Sun, 16 Nov 2014 11:02:45 -0800,
Alon Zakai <[email protected]> a écrit :
> The goal is to parse the JS output of the fastcomp LLVM backend. Then
> we run optimization passes on that AST.
>
> Thanks about TinyJS, looks interesting! Ok, at this point I am
> considering 3 options:
>
> 1. Modify TinyJS parser (already in C++, which is good)
> 2. Port Higgs parser from D (nicest written code of all the options)
> 3. Port Acorn parser from JS
>
> I am leaning to the last, because it seems the most active and
> maintained, and has support for parsing ES6 already (we don't need
> that immediately, but eventually we might). Also it is the only one
> that has focused on parsing speed, as far as I can tell.
>
> - Alon
>
>
>
> On Fri, Nov 14, 2014 at 7:44 PM, Marc <[email protected]> wrote:
>
> > This one is not bad:
> > https://code.google.com/p/tiny-js/source/browse/trunk/TinyJS.h
> >
> > There is only two files to include.
> >
> > The licence is ok (MIT like).
> >
> > Which part of the js files do you want to parse? Is it the generated
> > "LLVM as JS" output or any of the libraries you've made (like
> > "parseTools.js" or "analyzer.js").
> >
> > I've looked a bit at ANTLR but the grammar files for Javascript are
> > a old.
> >
> > There is a more "exotic" alternative I can imagine. It is to use
> > this Haskell parser:
> >
> > https://hackage.haskell.org/package/language-javascript
> >
> > The grammar file is really pretty:
> >
> >
> > https://github.com/alanz/language-javascript/blob/master/src/Language/JavaScript/Parser/Grammar5.y
> >
> > I know that GHC generates a kind of C (some "C--") as an
> > intermediate code. It is may be possible to wrap a function around
> > it.
> >
> > It's a crazy idea :-)
> >
> >
> >
> > Le Fri, 14 Nov 2014 16:43:55 -0800,
> > Alon Zakai <[email protected]> a écrit :
> >
> > > I wasn't familiar with that, thanks. Looks interesting, however
> > > the GPL license is a problem as we do want the option to run the
> > > parser on the client machine, linked to other code, and this
> > > would limit the amount of people that would use it.
> > >
> > > - Alon
> > >
> > >
> > > On Fri, Nov 14, 2014 at 3:04 AM, Marc <[email protected]> wrote:
> > >
> > > > Do you know this one?
> > > > https://github.com/cesanta/v7
> > > >
> > > > Le Thu, 13 Nov 2014 17:19:46 -0800,
> > > > Alon Zakai <[email protected]> a écrit :
> > > >
> > > > > Early this year the fastcomp project replaced the core
> > > > > compiler, which was written in JS, with an LLVM backend in
> > > > > C++, and that brought large compilation speedups. However,
> > > > > the late JS optimization passes were still run in JS, which
> > > > > meant optimized builds could be slow (in unoptimized builds,
> > > > > we don't run those JS optimizations, typically). Especially
> > > > > in very large projects, this could be annoying.
> > > > >
> > > > > Progress towards speeding up those JS optimization passes just
> > > > > landed, turned off, on incoming. This is not yet stable or
> > > > > ready, so it is *not* enabled by default. Feel free to test
> > > > > it though and report bugs. To use it, build with
> > > > >
> > > > > EMCC_NATIVE_OPTIMIZER=1
> > > > >
> > > > > in the environment, e.g.
> > > > >
> > > > > EMCC_NATIVE_OPTIMIZER=1 emcc -O2 tests/hello_world.c
> > > > >
> > > > > It just matters when building to JS (not building C++ to
> > > > > object/bitcode). When EMCC_DEBUG=1 is used, you should see it
> > > > > mention it uses the native optimizer. The first time you use
> > > > > it, it will also say it is compiling it, which can take
> > > > > several seconds.
> > > > >
> > > > > The native optimizer is basically a port of the JS optimizer
> > > > > passes from JS into c++11. c++11 features like lambdas made
> > > > > this much easier than it would have been otherwise, as the JS
> > > > > code has lots of lambdas. The ported code uses the same
> > > > > JSON-based AST, implemented in C++.
> > > > >
> > > > > Using c++11 is a little risky. We build the code natively,
> > > > > using clang from fastcomp, but we do use the system C++
> > > > > standard libraries. In principle if those are not
> > > > > c++11-friendly, problems could happen. It seems to work fine
> > > > > where I tested so far.
> > > > >
> > > > > Not all passes have been converted, but the main
> > > > > time-consuming passes in -O2 have been (eliminator,
> > > > > simplifyExpresions, registerize). (Note that in -O3 the
> > > > > registerizeHarder pass has *not* yet been converted.) The
> > > > > toolchain can handle running some passes in JS and some
> > > > > passes natively, using JSON to serialize them.
> > > > >
> > > > > Potentially this approach can speed us up very significantly,
> > > > > but it isn't quite there yet. JSON parsing/unparsing and
> > > > > running the passes themselves can be done natively, and in
> > > > > tests I see that running 4x faster, and using about half as
> > > > > much memory. However, there is overhead from serializing JSON
> > > > > between native and JS, which will remain until 100% of the
> > > > > passes you use are native. Also, and more significantly, we
> > > > > do not have a parser from JS - the output of fastcomp - to
> > > > > the JSON AST. That means that we send fastcomp output into JS
> > > > > to be parsed, it emits JSON, and we read that in the native
> > > > > optimizer.
> > > > >
> > > > > For those reasons, the current speedup is not dramatic. I see
> > > > > around a 10% improvement, far from how much we could reach.
> > > > >
> > > > > Further speedups will happen as the final passes are
> > > > > converted. The bigger issue is to write a JS parser in C++
> > > > > for this. This is not that easy as parsing JS is not that
> > > > > easy - there are some corner cases and ambiguities. I'm
> > > > > looking into existing code for this, but not sure there is
> > > > > anything we can easily use - JS engine parsers are in C++ but
> > > > > tend not to be easy to detach. If anyone has good ideas here
> > > > > that would be useful.
> > > > >
> > > > > - Alon
> > > > >
> > > >
> > > > --
> > > > You received this message because you are subscribed to the
> > > > Google Groups "emscripten-discuss" group.
> > > > To unsubscribe from this group and stop receiving emails from
> > > > it, send an email to
> > > > [email protected]. For more
> > > > options, visit https://groups.google.com/d/optout.
> > > >
> > >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "emscripten-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.
> >
>
--
You received this message because you are subscribed to the Google Groups
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.
//selfen lexen parsen
#include "assert.h"
#include "ctype.h"
#include "stdio.h"
#include "stdlib.h"
#include "string.h"
//constant
const char crlf[]="\r\n";
const char quote[]="\"";
//class
template <typename T> class pointer;
//prototype
const char* null(const char* value);
void print(const char* value);
void print(pointer<char>& value);
void print(const char* name,int value);
void print(const char* name,const char* value);
void ln();
void error(const char* source,const char* what);
//clear
void clear(pointer<char>& result);
//allocate
template <typename T> bool allocate(pointer<T>& result,int count);
bool allocate(pointer<char>& result,int count);
//deallocate
void deallocate(FILE* value)
{
if(value==0)
{
error("deallocate.file","null");
return;
}
if(fclose(value)!=0)
error("deallocate","fclose");
}
template <typename T> void deallocate(T* value)
{
if(value==0)
{
error("deallocate","null");
return;
}
free(value);
}
//pointer
template <typename T> class pointer
{
public:
T* value;
int length;
pointer()
{
this->value=0;
this->length=0;
}
virtual ~pointer()
{
this->clear();
}
//clear
void clear()
{
if(this->value!=0)
deallocate(this->value);
this->value=0;
this->length=0;
}
//attach
void attach(T* value,int length)
{
if(length<0)
error("pointer.attach","length");
this->clear();
this->value=value;
this->length=length;
}
//zeroize
void zeroize()
{
if(this->empty())
return;
if(memset(this->value,0,this->size_of())!=this->value)
error("pointer.zeroize","memset");
}
//front
T& front()
{
if(this->empty())
error("pointer.front","empty");
return this->at(0);
}
//back
T& back()
{
if(this->empty())
error("pointer.back","empty");
return this->at(this->length-1);
}
//at
T& at(int index)
{
if(this->empty())
error("pointer.at","empty");
if(index<0||index>=this->length)
error("pointer.at","range");
return this->value[index];
}
//get
T* get()
{
if(this->empty())
error("pointer.get","empty");
return this->value;
}
//size of
int size_of()
{
return this->length*(int)sizeof(T);
}
//empty
bool empty()
{
return this->value==0;
}
};
//clear
void clear(pointer<char>& result)
{
if(result.empty())
return;
//terminal zero
result.front()=0;
}
//allocate
template <typename T> bool allocate(pointer<T>& result,int count)
{
if(count<=0)
{
error("allocate","count");
return false;
}
//malloc
result.attach((T*)malloc(count*sizeof(T)),count);
if(result.empty())
{
error("allocate","malloc");
return false;
}
//zeroize
result.zeroize();
return true;
}
bool allocate(pointer<char>& result,int count)
{
if(!allocate<char>(result,count))
{
error("allocate","char");
return false;
}
//clear
clear(result);
return true;
}
//length
int length(const char* value)
{
if(value==0)
{
error("length","null");
return 0;
}
int result=(int)strlen(value);
if(result<0)
{
error("length","strlen");
return 0;
}
return result;
}
int length(pointer<char>& value)
{
int result=length(value.get());
if(result>=value.length) //with terminal zero
{
error("length","overflow");
return 0;
}
return result;
}
//append
bool append(pointer<char>& left,const char* right)
{
int left_length=length(left);
int right_length=length(right);
if(left_length+right_length>=left.length) //with terminal zero
{
error("append","overflow");
return false;
}
if(strcat(left.get(),right)!=left.get())
{
error("append","strcat");
return false;
}
return true;
}
bool append(pointer<char>& left,pointer<char>& right)
{
return append(left,right.get());
}
bool append(pointer<char>& left,int right)
{
char buffer[200]="";
if(sprintf(buffer,"%d",right)<=0)
{
error("append","sprintf");
return false;
}
append(left,buffer);
return true;
}
//load
bool load(pointer<char>& result,const char* path);
//node
class node
{
public:
enum class type
{
space,
};
};
//lexer
class lexer
{
public:
int index;
pointer<char> text;
lexer()
{
this->index=0;
}
//clear
void clear()
{
this->index=0;
this->text.clear();
}
//load
bool load(const char* path)
{
this->clear();
if(!::load(this->text,path))
{
error("lexer.load",path);
return false;
}
return true;
}
//single
bool single(pointer<char>& result)
{
if(!this->more())
return false;
result.front()=this->get();
result.at(1)=0;
this->index++;
return true;
}
//spaces
bool spaces(pointer<char>& result)
{
return this->is(result,isspace);
}
//blanks
bool blanks(pointer<char>& result)
{
return this->is(result,isblank);
}
//alnum
bool alnum(pointer<char>& result)
{
return this->is(result,isalnum);
}
//is
bool is(pointer<char>& result,int(*selector)(int))
{
if(selector==0)
{
error("lexer.is","selector");
return false;
}
int count=0;
while(this->more())
{
char current=this->get();
unsigned char byte=current;
if(!selector(byte))
break;
result.at(count)=current;
result.at(count+1)=0;
count++;
this->index++;
}
return count>0;
}
//open bracket
bool open_bracket(pointer<char>& result)
{
::clear(result);
if(this->separator("("))
{
append(result,")");
return true;
}
if(this->separator("{"))
{
append(result,"{");
return true;
}
if(this->separator("["))
{
append(result,"[");
return true;
}
return false;
}
//close bracket
bool close_bracket(pointer<char>& result)
{
::clear(result);
if(this->separator(")"))
{
append(result,")");
return true;
}
if(this->separator("}"))
{
append(result,"}");
return true;
}
if(this->separator("}"))
{
append(result,"}");
return true;
}
return false;
}
//separator
bool separator(const char* value)
{
if(value==0)
{
error("lexer.separator","null");
return false;
}
int cursor=this->index;
for(int i=0;value[i]!=0;i++)
{
if(!this->more()||this->get()!=value[i])
{
//rollback
this->index=cursor;
return false;
}
//forward
this->index++;
}
return true;
}
//new line
bool new_line()
{
return this->separator("\n");
}
//more
bool more()
{
return this->index<this->text.length;
}
//get
char get()
{
return this->text.at(this->index);
}
};
//parse
bool parse(const char* path)
{
lexer lexer;
//load
if(!lexer.load(path))
{
error("parse.load",path);
return false;
}
//buffer
pointer<char> buffer;
pointer<char> message;
allocate(buffer,1024);
allocate(message,1024);
//loop
int count=0;
int line=0;
while(lexer.more())
{
const char* category="none";
int size=0;
if(lexer.new_line())
{
category="new-line";
line++;
}
else if(lexer.blanks(buffer))
{
category="blank";
size=length(buffer);
}
else if(lexer.spaces(buffer))
{
category="space";
size=length(buffer);
}
else if(lexer.open_bracket(buffer))
{
category="open-bracket";
size=length(buffer);
}
else if(lexer.close_bracket(buffer))
{
category="close-bracket";
size=length(buffer);
}
else if(lexer.alnum(buffer))
{
category="alnum";
size=length(buffer);
}
else if(lexer.single(buffer))
{
category="single";
size=length(buffer);
}
else
{
error("parse","lexer");
return false;
}
clear(message);
append(message,"#");
append(message,count);
append(message," ");
append(message,"line=");
append(message,line);
append(message," ");
append(message,category);
append(message," ");
append(message,size);
append(message," ");
append(message,quote);
append(message,buffer);
append(message,quote);
print(message);
count++;
}
return true;
}
//main
int main(int count,char** arguments)
{
print("main.count",count);
for(int i=0;i<count;i++)
{
print("main.argument",arguments[i]);
}
//parse
ln();
int parsed=0;
for(int i=1;i<count;i++)
{
if(parse(arguments[i]))
parsed++;
else
error("parse",arguments[i]);
}
//pwd
ln();
if(parsed==0)
{
print("pwd");
system("pwd");
print("Hit a key ...");
system("read");
}
else
print("parsed",parsed);
}
//null
const char* null(const char* value)
{
if(value==0)
return "null";
return value;
}
//print
void print(const char* value)
{
if(printf("%s%s",null(value),crlf)<=0)
error("print","printf");
}
void print(pointer<char>& value)
{
print(value.get());
}
void print(const char* name,int value)
{
if(printf("%s=%d%s",null(name),value,crlf)<=0)
error("print","printf");
}
void print(const char* name,const char* value)
{
if(printf("%s=%s%s%s%s",null(name),quote,null(value),quote,crlf)<=0)
error("print","printf");
}
//ln
void ln()
{
print("");
}
//error
void error(const char* source,const char* what)
{
ln();
print("error.source",source);
print("error.what",what);
assert(false);
}
//load
bool load(pointer<char>& result,const char* path)
{
if(path==0)
{
error("load","path");
return false;
}
result.clear();
//open
pointer<FILE> file;
file.attach(fopen(path,"rb"),1);
if(file.empty())
{
error("fopen",path);
return false;
}
print("load.path",path);
//size
if(fseek(file.get(),0,SEEK_END)!=0)
{
error("fseek",path);
return false;
}
int size=(int)ftell(file.get());
if(size<=0)
{
error("ftell",path);
return false;
}
print("load.size",size);
if(fseek(file.get(),0,SEEK_SET)!=0)
{
error("fseek",path);
return false;
}
//allocate
if(!allocate(result,size+1))
{
error("load","allocate");
return false;
}
//read
if(fread(result.get(),size,1,file.get())!=1)
{
error("fread",path);
return false;
}
return true;
}