Re: [htdig-dev] found possible bug in htsearch

Gilles Detillieux Thu, 19 Feb 2004 12:32:46 -0800

According to Christopher Murtagh:
> On Thu, 2004-02-19 at 12:29, Gilles Detillieux wrote:
> > Yes, unless you configure 3.2.0b5 with --disable-shared, then
> > htlib/HtRegex.cc will wind up in the libht.so shared library, not right
> > in the htsearch binary.  So, you'd need to install the rebuilt shared
> > library, and possibly kill all running ht://Dig programs, before htsearch
> > would use the modified HtRegex code.
> 
>  Ok, I tried that with the latest snapshot (3.2.0b5-20040215) and had
> the same problem. Any restrict value that contained a '/' in it returned
> no results.
> 
>  However all is not lost. I now know that it is no longer an OS problem.
> I grabbed htdig-3.2.0b5.tar.gz from /files/ and performed the exact same
> config and make install. Now my restricts work properly, and I can pass
> values with '/' in them.
> 
>  So, this is definitely a bug that has come out recently.


Bingo!  Thanks for the sleuthing, Chris.  Here's the problem:

Sun Dec 21 2003 Lachlan Andrew <[EMAIL PROTECTED]>

    * htsearch/htsearch.cc:
      Improve handling of restrict/exclude URLs with spaces or encoded chars

This fix was based on a poorly-conceived "fix" offered by Jean-Sebastien
Morisset on Nov. 17, so he could more easily deal with restrict
patterns involving file names with embedded spaces.  I had replied to
his suggestion as follows...

   Spaces in file names will cause no end of grief at all sorts of levels,
   and if you have any control over the matter, it's always best to avoid
   them altogether as much as humanly possible.  They cause problems with
   some browsers, many HTML code generators, some proxy servers (even
   when properly encoded), and of course in situations like the one you
   just discovered.
   
   If you must stick with spaces in names, then you have to be very clever
   (or tricky) to make sure they stay properly encoded up to the point where
   they're needed.  In the case of a CGI input parameter, %xx hex encoding
   is decoded almost right away, as you've discovered, and so it won't match
   the URL which is still encoded, plus the space is taken as a separator,
   as you've also discovered.  Have you tried an extra level of encoding,
   i.e. encoding %20 as %2520?  In that way, the %25 in the CGI input
   parameter should decode as %, so you should be left with %20 in the
   "restrict" URL pattern.
   
   If that works, it would be a much better, and more logical solution,
   than hacking the code to try to preserve embedded spaces as spaces all
   along the way, as I can imagine this causing all sorts of other things
   to break in the future.

I never did hear back from Jean-Sebastien after that, but a month
later Lachlan decided to implement his suggestion in the 3.2 CVS code.
In fairness to Lachlan, he did ask for the opinion of other developers,
and no one responded at the time.  I intended to reiterate my objections,
but it was a busy time, just before the Christmas break, and after getting
back in the new year I got busy again and forgot.  In retrospect, maybe
requests for comment in late December aren't such a good idea.  ;-)

Unfortunately (or perhaps fortunately, as it brought the problem to
light more quickly), Lachlan didn't include Jean-Sebastien's suggested
valid character list as the optional 2nd argument to encodeURL(), so
the slashes in restrict/exclude patterns now get encoded to %2F.

I'm not one to say "I told you so," but I do think warnings about things
breaking in the future deserve some attention, especially when it comes
to adding features that, in my opinion, serve no purpose other than to
"save the user from himself."  If users are going to ignore the advice
of avoiding spaces and other reserved characters in web page file names,
and persist in using them, I think they deserve to have to deal with
the complications this causes.  Fixing the code to make things easy for
them, and difficult for everyone else, is not a particularly wise idea,
especially when they're given what should be a usable workaround and
don't even bother to give any feedback on this.

My recommendation is that we back out this feature until someone
can properly assess all the repercussions of it, and find a way of
implementing it that avoids all the pitfalls.  I think the slash is just
the visible tip of the iceberg, and that there are potential problems
lurking with other characters that get encoded.  The decision to encode
or not often depends on context, where the character appears in the URL,
and what the character is supposed to represent in the end.  It's not a
policy that's easily coded in a few lines of C++.  It seems to me that
without this fix, there is still a way of giving any pattern you need,
as long as you're careful to take into account the level of decoding
the CGI parameter will undergo.  With the patch, however, there are now
certain patterns that you simply cannot specify, because of the forced
encoding which will not necessarily match the encoding the actual URLs
may undergo.  I just think it's a bad idea, plain and simple.

Chris, if you want a quick fix, I think all you need to do is remove the
two encodeURL() calls in htsearch/htsearch.cc, and all should be well again.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] found possible bug in htsearch

Reply via email to