Re: Character Encoding...

The Editor Wed, 06 Jan 2010 06:07:27 -0800

I posted this under a new thread because it is an important
discussion--highlighting two character encoding solutions that are
better than our somewhat compromised solution of only encoding one
character (<). I'd like to hear feedback from others, before making a
final decision and stepping out one way or the other.

I've made a few notes below. Thanks Danny for clarifying the issues involved.

On Tue, Jan 5, 2010 at 1:04 PM, Danny <[email protected]> wrote:
> For encoding system, generally two common strategies:
> 1 is to do htmlspecialchar endode for all pages on save and to decode
> only on inserting code.* pages
> 2 is to save all pages as original and escape them on loading non-
> code* pages, datas, infos or so.
>
> good for 1:
> -better performance
> -simpler codes

The other advantage is enhanced security. If somehow the .htaccess
file gets corrupted or we forget to encode some part of a page before
returning it to the page, this option provides a second line of
defense. This is the reason we currently have < escaped.

The performance might be improved with this method, but I don't think
we are talking about any kind of significant difference. BoltWire is
already reasonably fast.

However, I'm not impressed with the simpler codes argument. To me it
seems we have terribly complex code trying to get everything always
displaying just right. Preview, escapes, code pages, etc. And always
trying to avoid double encoding... Arggh... Which is why I lean toward
#2...

> good for 2:
> - convenient for modifying the source files directly
> - less disk spent

Smaller filesize is probably trivial, but...

I do extensive editing all the time directly of the source files. In
fact, my goal was to make it possible to simply drop text files into a
boltwire pages folder and apart from changing their names, have them
instantly available to the wiki. No longer escaping < would increase
this functionality.

The complexity of trying to keep everything always encoded just right
is burdensome.  I would love to strip out all those confusing lines of
code and see everything just fall into place!

As for security, we have troubles either way. In this case, we don't
have to worry about making sure everything that writes to a page is
properly encoded, because it gets encoded on the way out.  I think
apache is probably secure enough. And for other potential security
risks, I suspect we can put up simple but effective road blocks.

For these reasons keeping the source simplest appears more attractive
to me. I think this could greatly simplify the core code. That may
change when we get around to implementing it, but I get the feeling
now that this will work very nicely.

> the bad is no matter 1 or 2, the backward compatibility would be
> terrible...

We can write a script to scan and fix the pages, we've done that
before. Easy enough. There may also be an easy workaround since we
would only be unescaping one character.

> I used to think about 2 because I didn't notice the choice of 1 and
> notice that data and info loading also needs escape, but now I think 1
> may be the better choice since we load often and save less.

Agreed, but I think the simplicity of the code (and easier to read
source files!) outweighs minor performance gains

> Also, the current data/info loading is buggy since ">" are not escaped
> so <b>xxx</b> in data/info would not be parsed as markups.

Hmmm, you are right. This can be fixed by inserting the second line
below into BOLTvarCache (in engine.php):

                        $d = substr($d, strpos($d, "\n~data~\n") + 8);
                        $d = str_replace('&lt;', '<', $d);

Just another example of the kind of thing we could eliminate if we
modified our current approach character encoding.

Cheers,
Dan

-- 
You received this message because you are subscribed to the Google Groups 
"BoltWire" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/boltwire?hl=en.

Re: Character Encoding...

Reply via email to