[il-antlr-interest: 32356] Re: [antlr-interest] Rematching AST Nodes

2011-05-02 Thread Courtney Falk

 You grammar doesn't have an 'aaa' token.  It does have CHARACTERS
 tokens.  If 'aaa' is special, then you need to match it in your grammar
 like a keyword.  Then you can reference it in your tree grammar.
 Otherwise you will need to match any CHARACTERS token in your rematch
 rule and do what you need to when the value is 'aaa' and do something
 else when it is not.

 Your tree grammars can only work with the tokens your lexers produce
 (and the same set that your parsers use as well).

That's unfortunate.  I'm working on a workaround using semantic 
predicates.  The huge downside is that I have to implement in a separate 
piece of Java code the boolean validation function for the semantic 
predicate.  Then in a second separate piece of Java code I implement the 
string parsing function.  This solution is far less elegant than 
implementing everything as ANTLR logic.


Court

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
il-antlr-interest group.
To post to this group, send email to il-antlr-inter...@googlegroups.com.
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.



[il-antlr-interest: 32357] Re: [antlr-interest] Lexer too quick to grab a token?

2011-05-02 Thread Bart Kiers
On Mon, May 2, 2011 at 1:19 AM, Todd O'Bryan toddobr...@gmail.com wrote:

 ...


 Does this make any sense? Is there some way to deal with it?
  ...


You could let '/]]' be matched in the 'R_TAG' rule and emit another token as
per the instructions described here:
http://www.antlr.org/wiki/pages/viewpage.action?pageId=3604497

A demo:

lexer grammar TLexer;

@members {

  ListToken tokens = new ArrayListToken();

  private void emit(String text, int type) {
Token token = new CommonToken(type, text);
token.setType(type);
emit(token);
  }

  @Override
  public void emit(Token token) {
state.token = token;
tokens.add(token);
  }

  @Override
  public Token nextToken() {
super.nextToken();
if(tokens.size() == 0) {
  return Token.EOF_TOKEN;
}
return (Token)tokens.remove(0);
  }
}

L_TAG
  :  '[/'
  ;

R_TAG
  :  '/]]' {emit(/, ANY); emit(]], R_BRACKET);}
  |  '/]'
  ;

L_BRACKET
  :  '[['
  ;

R_BRACKET
  :  ']]'
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

ANY
  :  .
  ;

which can be tested with the class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
String source = [/ foo /] [[/ bar /]];
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object o : tokens.getTokens()) {
  Token t = (Token)o;
  System.out.println(text= + t.getText() + , type= + t.getType());
}
  }
}


Regards,

Bart.

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
il-antlr-interest group.
To post to this group, send email to il-antlr-inter...@googlegroups.com.
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.



[il-antlr-interest: 32358] Re: [antlr-interest] Rematching AST Nodes

2011-05-02 Thread Jim Idle
I suspect that you are approaching this problem incorrectly in some way.
Why do you feel you need to specify a new token at the AST stage? Why
don't you restate your goal, ignoring what you have done so far - I
suspect that we may be trying to solve a problem that you should not have.

Jim

 -Original Message-
 From: antlr-interest-boun...@antlr.org [mailto:antlr-interest-
 boun...@antlr.org] On Behalf Of Courtney Falk
 Sent: Monday, May 02, 2011 5:29 AM
 To: antlr-interest@antlr.org
 Subject: Re: [antlr-interest] Rematching AST Nodes


  You grammar doesn't have an 'aaa' token.  It does have CHARACTERS
  tokens.  If 'aaa' is special, then you need to match it in your
  grammar like a keyword.  Then you can reference it in your tree
 grammar.
  Otherwise you will need to match any CHARACTERS token in your rematch
  rule and do what you need to when the value is 'aaa' and do something
  else when it is not.
 
  Your tree grammars can only work with the tokens your lexers produce
  (and the same set that your parsers use as well).

 That's unfortunate.  I'm working on a workaround using semantic
 predicates.  The huge downside is that I have to implement in a
 separate piece of Java code the boolean validation function for the
 semantic predicate.  Then in a second separate piece of Java code I
 implement the string parsing function.  This solution is far less
 elegant than implementing everything as ANTLR logic.


 Court

 List: http://www.antlr.org/mailman/listinfo/antlr-interest
 Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
 email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
il-antlr-interest group.
To post to this group, send email to il-antlr-inter...@googlegroups.com.
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.



[il-antlr-interest: 32359] Re: [antlr-interest] Rematching AST Nodes

2011-05-02 Thread Courtney Falk

 On 5/2/2011 9:47 AM, Jim Idle wrote:

I suspect that you are approaching this problem incorrectly in some way.
Why do you feel you need to specify a new token at the AST stage? Why
don't you restate your goal, ignoring what you have done so far - I
suspect that we may be trying to solve a problem that you should not have.


Certainly.  I was trying to keep things simple/short, but I can expand.

My project is a NLP tokenizer/parser.  The first stage of functionality 
is implemented the FuzzyLexer and FuzzyParser grammars.  They strip out 
all punctuation and white space, preserving them as tokens and grouping 
all the text between the punctuation/white space as unspecified tokens.


Stage 1.5 is the language-specific composite grammar (Sentential.g), 
which imports the Fuzzy* grammars.  Here, I implement regular 
expressions used in semantic predicates that attempt to categorize 
unspecified tokens into relevant categories (see also, 
LongNumber.java).  For instance, the string one would be cast as a 
long form number token.  Any unspecified tokens that don't match any 
semantic predicates stay unspecified tokens.


Stage 2, which is yet to be written, walks the AST output by stage 1.5 
and wraps the tokens up into an application-specific data structure.  
This tree grammar will also perform tasks such as clustering together 
numbers into one single number, etc.



Courtney Falk
co...@infiauto.com
lexer grammar FuzzyLexer;

options {
filter=UNSPECIFIED;
k=2;
}

@members {
private StringBuilder unknown;

{
unknown = new StringBuilder();
}

public void appendUnknown(char c) {
unknown.append(c);
}

public String getUnknown() {
String result = unknown.toString();
clearUnknown();
return result;
}

public void clearUnknown() {
unknown.delete(0, unknown.length());
}

public boolean isUnknownEmpty() {
return unknown.length() == 0;
}

@Override
public void match(String s)
throws MismatchedTokenException {

int i = 0;
while ( is.length() ) {
unknown.append((char)input.LA(1));

if ( input.LA(1)!=s.charAt(i) ) {
if ( state.backtracking0 ) {
state.failed = true;
return;
}

MismatchedTokenException mte =
new MismatchedTokenException(s.charAt(i), input);
recover(mte);
throw mte;
}

i++;
input.consume();
state.failed = false;
}

// successfully matched the string
clearUnknown();
}
}

ELLIPSIS : '...';

PERIOD : '.';

QUESTION_MARK : '?';

LEFT_QUESTION_MARK : '¿';

EXCLAMATION_POINT : '!';

LEFT_EXCLAMATION_POINT : '¡';

COMMA : ',';

COLON : ':';

SEMI_COLON : ';';

MDASH : '--';

DASH : '-';

FORWARD_SLASH : '/';

QUOTATION_MARK : '';

SINGLE_QUOTATION_MARK : '\'';

LEFT_PARENTHESIS : '(';

RIGHT_PARENTHESIS : ')';

LEFT_BRACKET : '[';

RIGHT_BRACKET : ']';

LEFT_BRACE : '{';

RIGHT_BRACE : '}';

WHITESPACE : ' ' | '\t' | '\r' | '\n';

protected
UNSPECIFIED : . { unknown.append(getText()); };parser grammar FuzzyParser;

@members {
public Sentential_FuzzyLexer lexer;

public void setLexer(Sentential_FuzzyLexer lexer) { this.lexer = lexer; }
}

whitespace : WHITESPACE+;

unspecified returns [String s]
: UNSPECIFIED+
{
$s = lexer.getUnknown();
}
;

nonterminal_punctuation
: COMMA
| COLON
| SEMI_COLON
| FORWARD_SLASH
| MDASH
| DASH
| QUOTATION_MARK
| SINGLE_QUOTATION_MARK
;

terminal_punctuation
: PERIOD
| EXCLAMATION_POINT
| QUESTION_MARK
| ELLIPSIS
;package com.infiauto.ontosem.lang.eng;

enum LongNumber {
ZERO(zero, 0, 0),
ONE(one, 0, 1),
TWO(two, 0, 2),
THREE(three, 0, 3),
FOUR(four, 0, 4),
FIVE(five, 0, 5),
SIX(six, 0, 6),
SEVEN(seven, 0, 7),
EIGHT(eight, 0, 8),
NINE(nine, 0, 9),
TEN(ten, 1, 10),
ELEVEN(eleven, 1, 11),
TWELVE(twelve, 1, 12),
THIRTEEN(thirteen, 1, 13),
FOURTEEN(fourteen, 1, 14),
FIFTEEN(fifteen, 1, 15),
SIXTEEN(sixteen, 1, 16),
SEVENTEEN(seventeen, 1, 17),
EIGHTEEN(eighteen, 1, 18),
NINTEEN(ninteen, 1, 19),
TWENTY(twenty, 1, 20),
THIRTY(thirty, 1, 30),
FORTY(forty, 1, 40),
FIFTY(fifty, 1, 50),
SIXTY(sixty, 1, 60),
SEVENTY(seventy, 1, 70),
EIGHTY(eighty, 1, 80),
NINTY(ninty, 1, 90),
HUNDRED(hundred, 2, 100),
THOUSAND(thousand, 3, 1000),
MILLION(million, 6, 100),
BILLION(billion, 9, 10);

private String long_form;
private long power;
private long value;

private LongNumber(String long_form, long power, long value) {
this.long_form = long_form;
this.power = power;
this.value = value;
}

public String getLongForm() { return long_form; }
public long