Re: [htdig-dev] "file name.html" -> "filename.html";(

Gilles Detillieux Thu, 14 Mar 2002 13:22:00 -0800

According to Jessica Biola:
> Is there a way to match spaces in a regex inside a
> url_rewrite_rules parameter, so that you could just
> do:
> 
> url_rewrite_rules: (.*)[:space:](.*) \1%20\2
> 
> (of course, you'd have to repeat this same rule
> multiple times to handle multiple spaces)  I tried the
> above rule and it didn't seem to work.  Characters
> inside the [brackets] were taken literally, and thus,
> the first s, p, a, c, or e were replaced with %20.
> 
> This may seem like a wimpy work-around, but it could
> be done without the need to modify any code
> internally, keeping htdig RFC2396 compliant at the
> same time.
> 
> So if you could help me with the regex I would
> appreciate it.


Interesting idea, but there are a few reasons it won't work:

1) As you discovered, the [:space:] character class isn't implemented.
This may actually be a function of which regex code ends up being used.
Some C libraries may implement this, but clearly that's not the case on
your system.  Even if your regex code does implement this, see point 3.

2) You can't use just a space in the regular expression, either with
or without the brackets, because url_rewrite_rules is parsed as a
string list, not a quoted string list, so there's no way to embed a
literal space in your regular expression.

3) Even if you could get around the two problems above, it still wouldn't
work because the URL class doesn't do the rewriting until AFTER it's
parsed the URL, and so the spaces are already stripped out in accordance
with RFC2396.

By the way, any trick you'd use to make htdig handle spaces within URLs
would be a violation of RFC2396, regardless of whether it required code
changes or just config file changes.  The standard says spaces should
be stripped out.  The way most web browsers handle spaces within URLs is
also a violation of RFC2396.  The question is whether/how we get htdig
to do likewise.

The change I had suggested previously, which Joe Jah wrote into a patch
mostly does things correctly.  Only one bit is missing.  All white space
characters other than the space itself are stripped out anywhere, and
the chop() call strips off trailing spaces, but there's nothing in that
patch to strip off leading spaces, which is what caused grief in Joe's
test of his patch.

What you could do is, in addition to Joe's patch, add the following
at the very start of URL::URL(char *ref, URL &parent)...

    while (*ref == ' ')
        ref++;

and this at the very start of URL::parse(char *u)...

    while (*u == ' ')
        u++;

before ref or u is assigned to the String "temp".

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] "file name.html" -> "filename.html";(

Reply via email to