I started a JS parser today (named "parsen" :-) ).

I have no github account yet so I put the code as an attachement
("parsen.main.cpp").

The objective is to make a program which parse its own code (the JS
output from emscripten).

For the moment, I have a lexer. It split lines ("\n"), spaces (blanks),
words ("alnum"), brackets (open/close, curly, square ...).

You can look at the "bool parse(const char* path)" in the source file.

I want to provide the JS AST as a flat indexed array (a parent/child
table).

Each cell should have:
 -its type (a friendly C++11 "enum class",
 "function","var","+" ...)
 -the index of its parent (in the array)
 -line/column number in the source file
 -its name/content ou value (floats ...)
 -...

I can also decode the special instructions of asm.js ("|0" ...).

The final idea is to be able to run queries over the AST (like LINQ).

Do something like:
 ast.select("function").select("var").dump();

Can you tell me more about your needs? What kind of patterns are you
looking for from the optimizer?

PS: Sorry for the many mistakes. I'm french and not good at all in
english!

Le Sun, 16 Nov 2014 11:02:45 -0800,
Alon Zakai <[email protected]> a écrit :

> The goal is to parse the JS output of the fastcomp LLVM backend. Then
> we run optimization passes on that AST.
> 
> Thanks about TinyJS, looks interesting! Ok, at this point I am
> considering 3 options:
> 
> 1. Modify TinyJS parser (already in C++, which is good)
> 2. Port Higgs parser from D (nicest written code of all the options)
> 3. Port Acorn parser from JS
> 
> I am leaning to the last, because it seems the most active and
> maintained, and has support for parsing ES6 already (we don't need
> that immediately, but eventually we might). Also it is the only one
> that has focused on parsing speed, as far as I can tell.
> 
> - Alon
> 
> 
> 
> On Fri, Nov 14, 2014 at 7:44 PM, Marc <[email protected]> wrote:
> 
> > This one is not bad:
> >  https://code.google.com/p/tiny-js/source/browse/trunk/TinyJS.h
> >
> > There is only two files to include.
> >
> > The licence is ok (MIT like).
> >
> > Which part of the js files do you want to parse? Is it the generated
> > "LLVM as JS" output or any of the libraries you've made (like
> > "parseTools.js" or "analyzer.js").
> >
> > I've looked a bit at ANTLR but the grammar files for Javascript are
> > a old.
> >
> > There is a more "exotic" alternative I can imagine. It is to use
> > this Haskell parser:
> >
> > https://hackage.haskell.org/package/language-javascript
> >
> > The grammar file is really pretty:
> >
> >
> > https://github.com/alanz/language-javascript/blob/master/src/Language/JavaScript/Parser/Grammar5.y
> >
> > I know that GHC generates a kind of C (some "C--") as an
> > intermediate code. It is may be possible to wrap a function around
> > it.
> >
> > It's a crazy idea :-)
> >
> >
> >
> > Le Fri, 14 Nov 2014 16:43:55 -0800,
> > Alon Zakai <[email protected]> a écrit :
> >
> > > I wasn't familiar with that, thanks. Looks interesting, however
> > > the GPL license is a problem as we do want the option to run the
> > > parser on the client machine, linked to other code, and this
> > > would limit the amount of people that would use it.
> > >
> > > - Alon
> > >
> > >
> > > On Fri, Nov 14, 2014 at 3:04 AM, Marc <[email protected]> wrote:
> > >
> > > > Do you know this one?
> > > >  https://github.com/cesanta/v7
> > > >
> > > > Le Thu, 13 Nov 2014 17:19:46 -0800,
> > > > Alon Zakai <[email protected]> a écrit :
> > > >
> > > > > Early this year the fastcomp project replaced the core
> > > > > compiler, which was written in JS, with an LLVM backend in
> > > > > C++, and that brought large compilation speedups. However,
> > > > > the late JS optimization passes were still run in JS, which
> > > > > meant optimized builds could be slow (in unoptimized builds,
> > > > > we don't run those JS optimizations, typically). Especially
> > > > > in very large projects, this could be annoying.
> > > > >
> > > > > Progress towards speeding up those JS optimization passes just
> > > > > landed, turned off, on incoming. This is not yet stable or
> > > > > ready, so it is *not* enabled by default. Feel free to test
> > > > > it though and report bugs. To use it, build with
> > > > >
> > > > > EMCC_NATIVE_OPTIMIZER=1
> > > > >
> > > > > in the environment, e.g.
> > > > >
> > > > > EMCC_NATIVE_OPTIMIZER=1 emcc -O2 tests/hello_world.c
> > > > >
> > > > > It just matters when building to JS (not building C++ to
> > > > > object/bitcode). When EMCC_DEBUG=1 is used, you should see it
> > > > > mention it uses the native optimizer. The first time you use
> > > > > it, it will also say it is compiling it, which can take
> > > > > several seconds.
> > > > >
> > > > > The native optimizer is basically a port of the JS optimizer
> > > > > passes from JS into c++11. c++11 features like lambdas made
> > > > > this much easier than it would have been otherwise, as the JS
> > > > > code has lots of lambdas. The ported code uses the same
> > > > > JSON-based AST, implemented in C++.
> > > > >
> > > > > Using c++11 is a little risky. We build the code natively,
> > > > > using clang from fastcomp, but we do use the system C++
> > > > > standard libraries. In principle if those are not
> > > > > c++11-friendly, problems could happen. It seems to work fine
> > > > > where I tested so far.
> > > > >
> > > > > Not all passes have been converted, but the main
> > > > > time-consuming passes in -O2 have been (eliminator,
> > > > > simplifyExpresions, registerize). (Note that in -O3 the
> > > > > registerizeHarder pass has *not* yet been converted.) The
> > > > > toolchain can handle running some passes in JS and some
> > > > > passes natively, using JSON to serialize them.
> > > > >
> > > > > Potentially this approach can speed us up very significantly,
> > > > > but it isn't quite there yet. JSON parsing/unparsing and
> > > > > running the passes themselves can be done natively, and in
> > > > > tests I see that running 4x faster, and using about half as
> > > > > much memory. However, there is overhead from serializing JSON
> > > > > between native and JS, which will remain until 100% of the
> > > > > passes you use are native. Also, and more significantly, we
> > > > > do not have a parser from JS - the output of fastcomp - to
> > > > > the JSON AST. That means that we send fastcomp output into JS
> > > > > to be parsed, it emits JSON, and we read that in the native
> > > > > optimizer.
> > > > >
> > > > > For those reasons, the current speedup is not dramatic. I see
> > > > > around a 10% improvement, far from how much we could reach.
> > > > >
> > > > > Further speedups will happen as the final passes are
> > > > > converted. The bigger issue is to write a JS parser in C++
> > > > > for this. This is not that easy as parsing JS is not that
> > > > > easy - there are some corner cases and ambiguities. I'm
> > > > > looking into existing code for this, but not sure there is
> > > > > anything we can easily use - JS engine parsers are in C++ but
> > > > > tend not to be easy to detach. If anyone has good ideas here
> > > > > that would be useful.
> > > > >
> > > > > - Alon
> > > > >
> > > >
> > > > --
> > > > You received this message because you are subscribed to the
> > > > Google Groups "emscripten-discuss" group.
> > > > To unsubscribe from this group and stop receiving emails from
> > > > it, send an email to
> > > > [email protected]. For more
> > > > options, visit https://groups.google.com/d/optout.
> > > >
> > >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "emscripten-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.
> >
> 

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.
//selfen lexen parsen

#include "assert.h"
#include "ctype.h"
#include "stdio.h"
#include "stdlib.h"
#include "string.h"

//constant

const char crlf[]="\r\n";
const char quote[]="\"";

//class

template <typename T> class pointer;

//prototype

const char* null(const char* value);

void print(const char* value);
void print(pointer<char>& value);
void print(const char* name,int value);
void print(const char* name,const char* value);
void ln();

void error(const char* source,const char* what);

//clear

void clear(pointer<char>& result);

//allocate

template <typename T> bool allocate(pointer<T>& result,int count);

bool allocate(pointer<char>& result,int count);

//deallocate

void deallocate(FILE* value)
{
 if(value==0)
 {
  error("deallocate.file","null");
  
  return;
 }
 
 if(fclose(value)!=0)
  error("deallocate","fclose"); 
}

template <typename T> void deallocate(T* value)
{
 if(value==0)
 {
  error("deallocate","null");
  
  return;
 }
 
 free(value);
}

//pointer

template <typename T> class pointer
{
 public:
 
 T* value;
 int length;
  
 pointer()
 {
  this->value=0;
  this->length=0;  
 }
 
 virtual ~pointer()
 {
  this->clear();
 }
 
 //clear
 
 void clear()
 {
  if(this->value!=0)
   deallocate(this->value);
   
  this->value=0;
  this->length=0;
 }
 
 //attach
 
 void attach(T* value,int length)
 {
  if(length<0)
   error("pointer.attach","length");
   
  this->clear();
  
  this->value=value;
  this->length=length;
 }
 
 //zeroize
 
 void zeroize()
 {
  if(this->empty())
   return;
    
  if(memset(this->value,0,this->size_of())!=this->value)
   error("pointer.zeroize","memset");
 } 
 
 //front
 
 T& front()
 {
  if(this->empty())
   error("pointer.front","empty");
   
  return this->at(0);
 }

 //back
 
 T& back()
 {
  if(this->empty())
   error("pointer.back","empty");
   
  return this->at(this->length-1);
 }
  
 //at
 
 T& at(int index)
 {
  if(this->empty())
   error("pointer.at","empty");
   
  if(index<0||index>=this->length)
   error("pointer.at","range");
  
  return this->value[index];
 }
 
 //get
 
 T* get()
 {
  if(this->empty())
   error("pointer.get","empty");
   
  return this->value;
 }
 
 //size of
 
 int size_of()
 {
  return this->length*(int)sizeof(T);
 }
 
 //empty
 
 bool empty()
 {
  return this->value==0;
 }
};

//clear

void clear(pointer<char>& result)
{
 if(result.empty())
  return;
  
 //terminal zero
 
 result.front()=0;
}

//allocate

template <typename T> bool allocate(pointer<T>& result,int count)
{ 
 if(count<=0)
 {
  error("allocate","count");
  
  return false;
 }
 
 //malloc
 
 result.attach((T*)malloc(count*sizeof(T)),count);
 
 if(result.empty())
 {
  error("allocate","malloc");
  
  return false;
 }
 
 //zeroize
 
 result.zeroize();
 
 return true;
}

bool allocate(pointer<char>& result,int count)
{
 if(!allocate<char>(result,count))
 {
  error("allocate","char");
  
  return false;
 }
 
 //clear
 
 clear(result);
  
 return true;
}

//length

int length(const char* value)
{
 if(value==0)
 {
  error("length","null");
  
  return 0;
 }
 
 int result=(int)strlen(value);
 
 if(result<0)
 {
  error("length","strlen");
  
  return 0;
 }
  
 return result;
}

int length(pointer<char>& value)
{
 int result=length(value.get());
 
 if(result>=value.length) //with terminal zero
 {
  error("length","overflow");
  
  return 0;
 }
 
 return result;
}

//append

bool append(pointer<char>& left,const char* right)
{
 int left_length=length(left);
 int right_length=length(right);
 
 if(left_length+right_length>=left.length) //with terminal zero
 {
  error("append","overflow");
  
  return false;
 }
 
 if(strcat(left.get(),right)!=left.get())
 {
  error("append","strcat");
  
  return false;
 }
 
 return true;
}

bool append(pointer<char>& left,pointer<char>& right)
{
 return append(left,right.get());
}

bool append(pointer<char>& left,int right)
{
 char buffer[200]="";
 
 if(sprintf(buffer,"%d",right)<=0)
 {
  error("append","sprintf");
  
  return false;
 }
 
 append(left,buffer);
 
 return true;
}

//load

bool load(pointer<char>& result,const char* path);

//node

class node
{
 public:
 
 enum class type
 {
  space,
  
 };
};

//lexer

class lexer
{
 public:
 
 int index;

 pointer<char> text;
 
 lexer()
 {
  this->index=0;
 }
 
 //clear
 
 void clear()
 {
  this->index=0;
  
  this->text.clear();
 }
 
 //load

 bool load(const char* path)
 {
  this->clear();
    
  if(!::load(this->text,path))
  {
   error("lexer.load",path);
   
   return false;
  }
  
  return true;
 }
 
 //single
 
 bool single(pointer<char>& result)
 {
  if(!this->more())
   return false;
  
  result.front()=this->get();
  result.at(1)=0;
  
  this->index++;
  
  return true;
 }
 
 //spaces

 bool spaces(pointer<char>& result)
 {
  return this->is(result,isspace);
 }

 //blanks

 bool blanks(pointer<char>& result)
 {
  return this->is(result,isblank);
 }

 //alnum

 bool alnum(pointer<char>& result)
 {
  return this->is(result,isalnum);
 }
 
 //is

 bool is(pointer<char>& result,int(*selector)(int))
 {
  if(selector==0)
  {
   error("lexer.is","selector");
   
   return false;
  }
  
  int count=0;
  
  while(this->more())
  {
   char current=this->get();
   unsigned char byte=current;
   
   if(!selector(byte))
    break;
  
   result.at(count)=current;
   result.at(count+1)=0;

   count++;
   
   this->index++;
  }
  
  return count>0;
 }

 //open bracket
 
 bool open_bracket(pointer<char>& result)
 {
  ::clear(result);
  
  if(this->separator("("))
  {
   append(result,")");
   
   return true;
  }
  
  if(this->separator("{"))
  {
   append(result,"{");
   
   return true;
  }
  
  if(this->separator("["))
  {
   append(result,"[");
   
   return true;
  }
  
  return false;
 }

 //close bracket
 
 bool close_bracket(pointer<char>& result)
 {
  ::clear(result);
  
  if(this->separator(")"))
  {
   append(result,")");
   
   return true;
  }
  
  if(this->separator("}"))
  {
   append(result,"}");
   
   return true;
  }
  
  if(this->separator("}"))
  {
   append(result,"}");
   
   return true;
  }
  
  return false;
 }
 
 //separator

 bool separator(const char* value)
 {
  if(value==0)
  {
   error("lexer.separator","null");
   
   return false;
  }

  int cursor=this->index;
  
  for(int i=0;value[i]!=0;i++)
  {  
   if(!this->more()||this->get()!=value[i])
   {
    //rollback
    
    this->index=cursor;
    
    return false;
   }
   
   //forward
   
   this->index++;
  }
    
  return true;
 }
 
 //new line

 bool new_line()
 {
  return this->separator("\n");
 }
 
 //more
 
 bool more()
 {
  return this->index<this->text.length;
 }
 
 //get
 
 char get()
 {
  return this->text.at(this->index);
 }
};

//parse

bool parse(const char* path)
{
 lexer lexer;
 
 //load
 
 if(!lexer.load(path))
 { 
  error("parse.load",path);
  
  return false;
 }
 
 //buffer
 
 pointer<char> buffer;
 pointer<char> message;
 
 allocate(buffer,1024);
 allocate(message,1024);
  
 //loop
 
 int count=0;
 int line=0; 
 
 while(lexer.more())
 {
  const char* category="none";
  int size=0;
  
  if(lexer.new_line())
  {
   category="new-line";
   
   line++;
  }
  else if(lexer.blanks(buffer))
  {
   category="blank";
      
   size=length(buffer);
  }
  else if(lexer.spaces(buffer))
  {
   category="space";
      
   size=length(buffer);
  }
  else if(lexer.open_bracket(buffer))
  {
   category="open-bracket";
      
   size=length(buffer);
  }
  else if(lexer.close_bracket(buffer))
  {
   category="close-bracket";
      
   size=length(buffer);
  }
  else if(lexer.alnum(buffer))
  {
   category="alnum";
      
   size=length(buffer);
  }
  else if(lexer.single(buffer))
  {
   category="single";
      
   size=length(buffer);
  }
  else
  {
   error("parse","lexer");
   
   return false;
  }
  
  clear(message);
  
  append(message,"#");
  append(message,count);
  append(message," ");
  append(message,"line=");
  append(message,line);
  append(message," ");
  append(message,category);
  append(message," ");
  append(message,size);
  append(message," ");
  append(message,quote);
  append(message,buffer);
  append(message,quote);

  print(message);
  
  count++;  
 }
 
 return true;
}

//main

int main(int count,char** arguments)
{
 print("main.count",count);
 
 for(int i=0;i<count;i++)
 {
  print("main.argument",arguments[i]);
 }
 
 //parse
 
 ln();
 
 int parsed=0;
  
 for(int i=1;i<count;i++)
 {
  if(parse(arguments[i]))
   parsed++;
  else
   error("parse",arguments[i]);      
 }
   
 //pwd
 
 ln();
 
 if(parsed==0)
 {
  print("pwd");
  system("pwd");
  
  print("Hit a key ...");
  system("read");
 }
 else
  print("parsed",parsed);
}

//null

const char* null(const char* value)
{
 if(value==0)
  return "null";
  
 return value;
}

//print

void print(const char* value)
{
 if(printf("%s%s",null(value),crlf)<=0)
  error("print","printf");
}

void print(pointer<char>& value)
{
 print(value.get());
}

void print(const char* name,int value)
{
 if(printf("%s=%d%s",null(name),value,crlf)<=0)
  error("print","printf");
}

void print(const char* name,const char* value)
{
 if(printf("%s=%s%s%s%s",null(name),quote,null(value),quote,crlf)<=0)
  error("print","printf");
}

//ln

void ln()
{
 print("");
}

//error

void error(const char* source,const char* what)
{
 ln();
 
 print("error.source",source);
 print("error.what",what);
 
 assert(false);
}

//load

bool load(pointer<char>& result,const char* path)
{
 if(path==0)
 {
  error("load","path");
  
  return false;
 }
 
 result.clear();
  
 //open
  
 pointer<FILE> file;
 
 file.attach(fopen(path,"rb"),1);
 
 if(file.empty())
 {
  error("fopen",path);
  
  return false;
 }

 print("load.path",path);
 
 //size
 
 if(fseek(file.get(),0,SEEK_END)!=0)
 {
  error("fseek",path);
  
  return false;
 }
  
 int size=(int)ftell(file.get());
  
 if(size<=0)
 {
  error("ftell",path);
  
  return false;
 }
  
 print("load.size",size);

 if(fseek(file.get(),0,SEEK_SET)!=0)
 {
  error("fseek",path);
  
  return false;
 }
  
 //allocate
    
 if(!allocate(result,size+1))
 {
  error("load","allocate");
  
  return false;
 }
 
 //read
    
 if(fread(result.get(),size,1,file.get())!=1)
 {
  error("fread",path);
  
  return false;
 }  
 
 return true;
}

Reply via email to