Re: html entities in page names

The Editor Wed, 21 Oct 2009 16:51:36 -0700

Sorry, my last post was to the wrong thread. I'll move it over and respond here:

On Wed, Oct 21, 2009 at 12:47 PM, Hans <[email protected]> wrote:
>
>> Note: you can already enter any of those special chars directly into
>> your browser and/or a link and they work fine, thanks to UTF.
>
> Yes, I said so. It is not as if we are talking about illegal characters.

I know I was just pointing out we are not really discriminating
against greek letter users. All languages (including chinese and
greek) are treated exactly the same. All we are really doing is saying
you can't create a page name using html entities. You have to use the
actual symbol.

While it is not so convenient to enter the html entities on a page,
save it, then cut and paste the symbols to your url bar or page name
input field, it is easy enough to do. So we also have an easy
workaround in place

>> For instance I just created a page like this:  test.Ξ⇔—© (Greek
>> letter, mapping mark, dash, and symbol) and it all works fine. The
>> only thing we are talking about are special characters (mostly
>> punctuation) that tend to produce problems in the core.
>
> No. I don't understand why you insist we are talking about "mostly
> punctuation" characters. This is not true. I gave links to pages with
> lists of HTML 4 entities. And we are not talking about characters
> which "produce problems in the core", I think. As page name characters
> they are url encoded. And characters like <>"# and a few more are not
> allowed and filtered out already.

The only characters you can't use are those in the
$BOLTutfEscapeChars, which are all punctuation. Everything else gets
url encoded and is allowed. I can put anything in a url or link except
those chars, because they otherwise get url encoded and pass the
filter. So while I can't use &copy; there is nothing stopping me from
using ©.

The real question seems to be more whether or not we want to allow
direct submission of html entities into page names/address bars and
how we should handle those. Or to put it differently, it's really not
about whether we can use special chars (we can), it's just whether or
not we want to allow another mechanism for introducing them, and
perhaps expanding the possibilitiies to also include somehow the
utfEscapeChars.

>> Of course you cannot currently enter htmlentities for these chars and
>> have it work. You have to enter the actual symbol. So that might be
>> something worth pursuing. I'm just not convinced--as I much prefer
>> what you have go in come out. Not be automatically translated more
>> than necessary. I'll try and work on this some later today, but I have
>> a busy schedule today...
>
> Okay, I try to convince you: I appreciate the principle of "what goes
> in comes out", but although we can enter symbols covered as html
> entities directly, a) it may not always be convenient, and b) we
> cannot use html entities as code in page names, as the ampersand is  a
> special HTML character, and used as argument separator in urls.
>
> So we have the situation where we can enter HTML entities in the page
> content, and have these saved as they are, as code, and displayed
> decoded in a normal page, but we cannot do or expect to do the same
> for page names. Instead we can agree that HTML entities get translated
> to url % codes, and that way they get displayed decoded, as symbols
> etc. in the address bar as well as in page lists, messages etc.

Ok, took me a minute to figure out what you are saying, but it seems
to be that the way we handle page names is not the same as the way we
handle page content. That clicks with me. Of course with page content
we retain the html entity in the source. There is no original source
for a page name--so it is different in at least some regards.

Here is another example of a difference. Suppose we fix boltwire so it
can take a page name like page.a&lt;b, change it to page.a<b and then
urlencode it and get it to work. (Hopefully without opening any
unanticipated security vulnerabilities, somewhere in the process of
course.)  But then I couldn't put page.a&lt;b in the browser address
bar to create a new page, or go to it, because the browser will
interpret that as page.a. And that seems to be something hardcoded in
the browser...

In other words, we have an inconsistency in that you can enter
something in a create pagename field, but not that same string in the
url bar. Don't like that. Of course the flip side is also true: I can
put [[page.&copy;]] and [[page.©]] on a page and both are identical.
But there is a huge difference between page.© and page.&copy; when it
comes to my pagename input field. So we're inconsistent there... Which
is your point of course.

If the only problem is the input fields, (I don't think we can solve
the address bar issue), doesn't it make sense to add a single line
(maybe in BOLTXtarget, or BOLTpageshortcuts) that automatically
decodes any html entities in a pagename--before it is filtered? So it
would be essentially as if you had entered the correct characters?
This way we still block our dozen or so problem characters, but not
any other characters... And no worries about any other changes in the
core code...  What do you think of this idea?

In summary,
1) I am concerned about possible security vulnerabilities by allowing
risky chars in page names.
2) I am worried about possible bugs, such as special chars being
entered that have pageshortcut meanings, and confusion with get
variables.
3) I am concerned that certain pages could be created via a form but
not created via the url bar.  I realize however, there's probably
nothing we can do about the url bar, either way.

I can live with the fact the html entity I type in, comes back out
differently as a page name, after all that's what happens with page
content. And if an html entity is entered, it was likely intentional.

I don't sense a great burden to allow the entry of punctuation into
page names, and feel we have adequate workarounds for virtually any
other letter or symbol. But I do understand the issue is really
whether or not we can inject these other symbols via html entities
into page names, rather than requiring them to be entered directly as
symbols. And that in some cases the latter might be inconvenient.

I'll try your fixes tomorrow, and maybe take a couple stabs at my idea
and see what comes out. Thanks for challenging my thinking as usual to
explore all the possibilities of BoltWire...

Cheers,
Dan

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"BoltWire" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/boltwire?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: html entities in page names

Reply via email to