https://bugs.documentfoundation.org/show_bug.cgi?id=96413
--- Comment #20 from Mike Kaganski <[email protected]> --- (In reply to Andreas Säger from comment #0) Just to expand the last comment. RFC 8089 [1] sect. 2 defines 'file' URL as: > file-URI = file-scheme ":" file-hier-part > > file-hier-part = ( "//" auth-path ) > / local-path > > auth-path = [ file-auth ] path-absolute > > local-path = path-absolute > > file-auth = "localhost" > / host with the explanation "importing the "host" and "path-absolute" rules from [RFC3986] (as updated by [RFC6874])". RFC 3986 defines them this way: > host = ... > path-absolute = "/" [ segment-nz *( "/" segment ) ] > segment = *pchar > segment-nz = 1*pchar > pchar = unreserved / pct-encoded / sub-delims / ":" / "@" > pct-encoded = "%" HEXDIG HEXDIG > unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" > sub-delims = "!" / "$" / "&" / "'" / "(" / ")" > / "*" / "+" / "," / ";" / "=" And finally, ALPHA, DIGIT, and HEXDIG are defined in RFC 2234 in an obvious way. The scheme [1] is known both in Basic's ConvertToURL, and in Python's uno.systemPathToFileUrl, as mentioned in comment 18, simply because the functions are defined to convert system path to file URL. The system path characters are converted to characters of the URL's 'file-hier-part'; more precisely, to 'path-absolute', because 'file-auth' is not applicable to system-path-to-URL conversion (the auth information is not available in the system path). Further, I omitted the 'host' part, because the issue did not consider "hosted" system paths, like UNC, so this allows to avoid additional complexity (which doesn't change anything, just clutters the text). Now let us consider the issue character by character: > Problem occurs with characters > ! > $ > & > ' > ( > ) > * > + > , > = All the above are 'sub-delims', explicitly allowed in 'pchar', which constitute both 'segment' and 'segment-nz' of 'path-absolute'. > / This one is explicitly shown as the character delimiting segments in 'path-absolute'. Both Linux filesystems, and Windows filesystems use this character as hierarchy delimiter, so whenever it appears in the source path, it gets "converted" to itself in the resulting URL. > : > @ These two are allowed explicitly in 'pchar'. > ConvertToURL correctly encodes the follwing characters: > <space> > " > { > | > } These are not listed at all among the "very limited set", from which URIs consist (RFC 3986 sect. 1.2.1; Appendix a), so it must be percent-encoded. > \ This one gets converted to "/" on Windows, since it's a hierarchy separator there; on Linux, it behaves the same as the five above. > % This is the character used in 'pct-encoded'; and its conversion is explicitly defined in RFC 3986 sect. 2.4. > # > ? > [ > ] These are 'gen-delims' that aren't explicitly mentioned as allowed in the 'path-absolute' components. > ; And this one is interesting. It is part of 'sub-delims', so in theory, could stay as is. Interestingly, both INetURLObject [2] and osl_getFileURLFromSystemPath [3] (which are used in Basic and Python, respectively), percent-encode it. The most likely reason is that the previous version of the "Uniform Resource Identifiers (URI): Generic Syntax" standard, RFC 2396, didn't allow that character there. [1] https://www.rfc-editor.org/rfc/rfc8089 [2] https://opengrok.libreoffice.org/xref/core/include/tools/urlobj.hxx?r=485300f9#179 [3] https://opengrok.libreoffice.org/xref/core/include/osl/file.h?r=0ce7c84c#1443 -- You are receiving this mail because: You are the assignee for the bug.
