I posted this under a new thread because it is an important discussion--highlighting two character encoding solutions that are better than our somewhat compromised solution of only encoding one character (<). I'd like to hear feedback from others, before making a final decision and stepping out one way or the other.
I've made a few notes below. Thanks Danny for clarifying the issues involved. On Tue, Jan 5, 2010 at 1:04 PM, Danny <[email protected]> wrote: > For encoding system, generally two common strategies: > 1 is to do htmlspecialchar endode for all pages on save and to decode > only on inserting code.* pages > 2 is to save all pages as original and escape them on loading non- > code* pages, datas, infos or so. > > good for 1: > -better performance > -simpler codes The other advantage is enhanced security. If somehow the .htaccess file gets corrupted or we forget to encode some part of a page before returning it to the page, this option provides a second line of defense. This is the reason we currently have < escaped. The performance might be improved with this method, but I don't think we are talking about any kind of significant difference. BoltWire is already reasonably fast. However, I'm not impressed with the simpler codes argument. To me it seems we have terribly complex code trying to get everything always displaying just right. Preview, escapes, code pages, etc. And always trying to avoid double encoding... Arggh... Which is why I lean toward #2... > good for 2: > - convenient for modifying the source files directly > - less disk spent Smaller filesize is probably trivial, but... I do extensive editing all the time directly of the source files. In fact, my goal was to make it possible to simply drop text files into a boltwire pages folder and apart from changing their names, have them instantly available to the wiki. No longer escaping < would increase this functionality. The complexity of trying to keep everything always encoded just right is burdensome. I would love to strip out all those confusing lines of code and see everything just fall into place! As for security, we have troubles either way. In this case, we don't have to worry about making sure everything that writes to a page is properly encoded, because it gets encoded on the way out. I think apache is probably secure enough. And for other potential security risks, I suspect we can put up simple but effective road blocks. For these reasons keeping the source simplest appears more attractive to me. I think this could greatly simplify the core code. That may change when we get around to implementing it, but I get the feeling now that this will work very nicely. > the bad is no matter 1 or 2, the backward compatibility would be > terrible... We can write a script to scan and fix the pages, we've done that before. Easy enough. There may also be an easy workaround since we would only be unescaping one character. > I used to think about 2 because I didn't notice the choice of 1 and > notice that data and info loading also needs escape, but now I think 1 > may be the better choice since we load often and save less. Agreed, but I think the simplicity of the code (and easier to read source files!) outweighs minor performance gains > Also, the current data/info loading is buggy since ">" are not escaped > so <b>xxx</b> in data/info would not be parsed as markups. Hmmm, you are right. This can be fixed by inserting the second line below into BOLTvarCache (in engine.php): $d = substr($d, strpos($d, "\n~data~\n") + 8); $d = str_replace('<', '<', $d); Just another example of the kind of thing we could eliminate if we modified our current approach character encoding. Cheers, Dan
-- You received this message because you are subscribed to the Google Groups "BoltWire" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/boltwire?hl=en.
