[Libreoffice-bugs] [Bug 96413] Basic ConvertToURL fails to URL encode many characters (see comment 18)

bugzilla-daemon Mon, 17 Apr 2023 04:53:59 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=96413


--- Comment #20 from Mike Kaganski <[email protected]> ---
(In reply to Andreas Säger from comment #0)

Just to expand the last comment.

RFC 8089 [1] sect. 2 defines 'file' URL as:

> file-URI       = file-scheme ":" file-hier-part
> 
> file-hier-part = ( "//" auth-path )
>                / local-path
> 
> auth-path      = [ file-auth ] path-absolute
> 
> local-path     = path-absolute
> 
> file-auth      = "localhost"
>                / host

with the explanation "importing the "host" and "path-absolute" rules from
[RFC3986] (as updated by [RFC6874])".

RFC 3986 defines them this way:

> host          = ...
> path-absolute = "/" [ segment-nz *( "/" segment ) ]
> segment       = *pchar
> segment-nz    = 1*pchar
> pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
> pct-encoded   = "%" HEXDIG HEXDIG
> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>               / "*" / "+" / "," / ";" / "="

And finally, ALPHA, DIGIT, and HEXDIG are defined in RFC 2234 in an obvious
way.

The scheme [1] is known both in Basic's ConvertToURL, and in Python's
uno.systemPathToFileUrl, as mentioned in comment 18, simply because the
functions are defined to convert system path to file URL.

The system path characters are converted to characters of the URL's
'file-hier-part'; more precisely, to 'path-absolute', because 'file-auth' is
not applicable to system-path-to-URL conversion (the auth information is not
available in the system path). Further, I omitted the 'host' part, because the
issue did not consider "hosted" system paths, like UNC, so this allows to avoid
additional complexity (which doesn't change anything, just clutters the text).

Now let us consider the issue character by character:

> Problem occurs with characters
> !
> $
> &
> '
> (
> )
> *
> +
> ,
> =

All the above are 'sub-delims', explicitly allowed in 'pchar', which constitute
both 'segment' and 'segment-nz' of 'path-absolute'.

> /

This one is explicitly shown as the character delimiting segments in
'path-absolute'. Both Linux filesystems, and Windows filesystems use this
character as hierarchy delimiter, so whenever it appears in the source path, it
gets "converted" to itself in the resulting URL.

> :
> @

These two are allowed explicitly in 'pchar'.

> ConvertToURL correctly encodes the follwing characters:
> <space> 
> "
> {
> |
> }

These are not listed at all among the "very limited set", from which URIs
consist (RFC 3986 sect. 1.2.1; Appendix a), so it must be percent-encoded.

> \

This one gets converted to "/" on Windows, since it's a hierarchy separator
there; on Linux, it behaves the same as the five above.

> %

This is the character used in 'pct-encoded'; and its conversion is explicitly
defined in RFC 3986 sect. 2.4.

> #
> ?
> [
> ]

These are 'gen-delims' that aren't explicitly mentioned as allowed in the
'path-absolute' components.

> ;

And this one is interesting. It is part of 'sub-delims', so in theory, could
stay as is. Interestingly, both INetURLObject [2] and
osl_getFileURLFromSystemPath [3] (which are used in Basic and Python,
respectively), percent-encode it. The most likely reason is that the previous
version of the "Uniform Resource Identifiers (URI): Generic Syntax" standard,
RFC 2396, didn't allow that character there.

[1] https://www.rfc-editor.org/rfc/rfc8089
[2]
https://opengrok.libreoffice.org/xref/core/include/tools/urlobj.hxx?r=485300f9#179
[3]
https://opengrok.libreoffice.org/xref/core/include/osl/file.h?r=0ce7c84c#1443

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 96413] Basic ConvertToURL fails to URL encode many characters (see comment 18)

Reply via email to