Hi,
I'm encountering problems when attempting to retrieve pages from websites
over which I have no control and about which I can only learn from the
response headers (which may not be providing the correct information).
For example, I'm attempting to retrieve pages regardless of their encoding.
xdmp:http-get throws an error if the remote resource is not UTF-8 encoded:
let $http-get-options := <options xmlns="xdmp:http">
<verify-cert>false</verify-cert>
</options>
return xdmp:http-get("
http://www.elmoudjahid.com/fr/actualites/64076/?comopen", $http-get-options)
=> XDMP-DOCUTF8SEQ: xdmp:http-get("
http://www.elmoudjahid.com/fr/actualites/64076/?comopen", <options
xmlns="xdmp:http"><verify-cert>false</verify-cert><repair
xmlns="xdmp:document-get...</options>) -- Invalid UTF-8 escape sequence at
http://www.elmoudjahid.com/fr/actualites/64076/?comopen line 299 --
document is not UTF-8 encoded
If I add the option <encoding
xmlns="xdmp:document-get">ISO-8859-1</encoding> the contents is retrieved
as expected. I could inspect the content-type header which *may* contain
encoding information (or not, or it may not actually be true). I could
brute-force via nested try/catch constructs a number of probable encodings,
but I'd hope there is more intelligent ... ?
Another use case is the retrieval of pages via HTTPS where the certificate
is no longer valid. As you can see from the example above, I'm setting the
<verify-cert> option to false by default, because I really don't care
whether the webmaster looks after their certificates (should I?).
Then there are the 30x responses which may/do contain Location headers or
more redirects for the final resource location.
As you can see/are probable aware, there are lots of parameters that may
cause xdmp:http-get to not return the expected result. Does someone have a
wrapper function up their sleeves that they don't mind sharing? Or point
me to an algorithm or an implementation (or at least to a complete list of
above issues)? I've written my own recursive function to resolve 30x
responses, but it's just a little thing among many others.
cheers,
Jakob.
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general