Daniel,
Lexical analysis (i.e., tokenization) and parsing are two separate activities.
Of which I no little about either.
Sometimes you can get away with combining the two, but you'll find you can
only do so much with split.
I started coming to that conclusion as well.
Define a regular expression for each of
your tokens and consume the input matching against each in a specified
order.
Since my knowlege with regexp is slim to nil at best and I hadn't heard back from
anyone on this list and my Googling produced only almost solutions, I was going
to brute force it and extract all quoted strings and then what ever was left, split
it against white space.
token first to avoid misidentifying whitespace. It so happens that you can
manage this with split. You appear to have almost gotten there already with:
Here is what I've tried (but it doesn't cover escaping metacharacters
which might be in the search string):
/(.*?)|(\w+)/
I didn't come up with this myself, I found it on the web from someone who
was trying to solve the same problem as myself. Call it lazy, but if you've
been in the Windoz world and aren't exposed to regular expressions often, then
they (and the rules involved) are quite intimidating. To me, my problem seemed so
common (parsing a search string into tokens)... and someone mentioned trying to
use String.split() which uses regexp which is why I ended up here.
Thanks for the advice and help.
robert
-Original Message-
From: Daniel F. Savarese [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 4:05 PM
To: ORO Users List
Subject: Re: splitting a search string into tokens
In message [EMAIL PROTECTED], Robert Taylor
writes:
I need to parse the search string into tokens in the manner that search engine
s would.
Lexical analysis (i.e., tokenization) and parsing are two separate activities.
Sometimes you can get away with combining the two, but you'll find you can
only do so much with split. Define a regular expression for each of
your tokens and consume the input matching against each in a specified
order. In your case, tokens appear to be either \s+ (i.e., the separator
which would be discarded), \S+, and [^]+?. You have to test for the last
token first to avoid misidentifying whitespace. It so happens that you can
manage this with split. You appear to have almost gotten there already with:
Here is what I've tried (but it doesn't cover escaping metacharacters
which might be in the search string):
/(.*?)|(\w+)/
I don't understand where your search string and escaped metacharacters
enter the picture. If you need to escape metacharacters in a string,
use Perl5Compiler.quotemeta. I hope that helps.
daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]