RE: splitting a search string into tokens

2004-04-05 Thread Robert Taylor
Daniel, 

 Lexical analysis (i.e., tokenization) and parsing are two separate activities.
Of which I no little about either.

 Sometimes you can get away with combining the two, but you'll find you can
 only do so much with split.  
I started coming to that conclusion as well.

Define a regular expression for each of
 your tokens and consume the input matching against each in a specified
 order.
Since my knowlege with regexp is slim to nil at best and I hadn't heard back from
anyone on this list and my Googling produced only almost solutions, I was going
to brute force it and extract all quoted strings and then what ever was left, split
it against white space.


 token first to avoid misidentifying whitespace.  It so happens that you can
 manage this with split.  You appear to have almost gotten there already with:
 
 Here is what I've tried (but it doesn't cover escaping metacharacters
 which might be in the search string):
 
 /(.*?)|(\w+)/
I didn't come up with this myself, I found it on the web from someone who
was trying to solve the same problem as myself. Call it lazy, but if you've
been in the Windoz world and aren't exposed to regular expressions often, then
they (and the rules involved) are quite intimidating. To me, my problem seemed so
common (parsing a search string into tokens)... and someone mentioned trying to
use String.split() which uses regexp which is why I ended up here.



Thanks for the advice and help.

robert

 -Original Message-
 From: Daniel F. Savarese [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 04, 2004 4:05 PM
 To: ORO Users List
 Subject: Re: splitting a search string into tokens 
 
 
 
 In message [EMAIL PROTECTED], Robert Taylor
  writes:
 I need to parse the search string into tokens in the manner that search engine
 s would.
 
 Lexical analysis (i.e., tokenization) and parsing are two separate activities.
 Sometimes you can get away with combining the two, but you'll find you can
 only do so much with split.  Define a regular expression for each of
 your tokens and consume the input matching against each in a specified
 order.  In your case, tokens appear to be either \s+ (i.e., the separator
 which would be discarded), \S+, and [^]+?.  You have to test for the last
 token first to avoid misidentifying whitespace.  It so happens that you can
 manage this with split.  You appear to have almost gotten there already with:
 
 Here is what I've tried (but it doesn't cover escaping metacharacters
 which might be in the search string):
 
 /(.*?)|(\w+)/
 
 I don't understand where your search string and escaped metacharacters
 enter the picture.  If you need to escape metacharacters in a string,
 use Perl5Compiler.quotemeta.  I hope that helps.
 
 daniel
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: splitting a search string into tokens

2004-04-04 Thread Daniel F. Savarese

In message [EMAIL PROTECTED], Robert Taylor
 writes:
I need to parse the search string into tokens in the manner that search engine
s would.

Lexical analysis (i.e., tokenization) and parsing are two separate activities.
Sometimes you can get away with combining the two, but you'll find you can
only do so much with split.  Define a regular expression for each of
your tokens and consume the input matching against each in a specified
order.  In your case, tokens appear to be either \s+ (i.e., the separator
which would be discarded), \S+, and [^]+?.  You have to test for the last
token first to avoid misidentifying whitespace.  It so happens that you can
manage this with split.  You appear to have almost gotten there already with:

Here is what I've tried (but it doesn't cover escaping metacharacters
which might be in the search string):

/(.*?)|(\w+)/

I don't understand where your search string and escaped metacharacters
enter the picture.  If you need to escape metacharacters in a string,
use Perl5Compiler.quotemeta.  I hope that helps.

daniel



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]