As I understand it the main reason to check certificates is to avoid 
man-in-the-middle attacks. Do you care if a hostile party intercepts your 
request and returns its own response, instead of the actual site response?

In the past when I've needed a really robust spider, I've used existing 
software: http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers lists a 
few. In theory it's possible to do all that in XQuery, but I'd rather not 
reinvent the wheel.

For XDMP-DOCUTF8SEQ specifically, per 
https://docs.marklogic.com/xdmp:document-get "An automatic encoding detector 
will be used if the value auto is specified." So the usual advice would be to 
add:

    <encoding xmlns="xdmp:document-get">auto</encoding>

However it still throws XDMP-DOCUTF8SEQ even with encoding=auto. Just possibly 
there's a bug there, or perhaps this site lies:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

We could fetch the document with format=binary - but 
xdmp:encoding-language-detect still thinks it's UTF-8. It also claims to be 
XHTML and yet doesn't seem to be well-formed, but that's relatively easy to 
handle compared to the encoding.

I'd try to avoid fetching the document multiple times, because network wait 
will be the largest bottleneck in most cases. So I'd always fetch once, as 
binary, and then have a sequence of encodings to try. Some of this turns out to 
be ugly code, and you're rewriting stuff that xdmp:http-get normally does for 
you. But you only have to write it once.

The sample below might get you started. It seems to work ok with MarkLogic 
7.0-3. Obvious extensions include an option to repair XML, and an option to try 
xdmp:tidy if the body is text/html. As you run into other errors you could add 
handlers to the appropriate switch expressions.

It might also be useful to return a new metadata element that explains what 
happened during processing, sort of like the xdmp:http-get response element.

If you want to integrate this with your redirect handling code and start a 
github repo, please do.

-- Mike

declare namespace xhttp="xdmp:http" ;
declare namespace xeld="xdmp:encoding-language-detect" ;

declare function local:http-get-encoding(
  $encoding as xs:string,
  $n as node())
as xs:string
{
  switch($encoding)
  case 'auto' return xdmp:encoding-language-detect($n)[1]/xeld:encoding
  default return $encoding
};

declare function local:http-get-body(
  $body as document-node()?,
  $type as xs:string,
  $encodings as xs:string*)
as node()+
{
  if (empty($body)) then ()
  else if (empty($encodings) or $type eq 'binary') then $body
  else
  let $encoding := local:http-get-encoding($encodings[1], $body)
  return try {
    xdmp:binary-decode($body, $encoding) ! (
      switch($type)
      case 'text' return document { text { . } }
      default return try { xdmp:unquote(.) } catch($ex) {
        switch($ex/error:code)
        (: Bad XML. Extend as needed. :)
        case 'XDMP-DOCNOENDTAG' return document { text { . } }
        default return xdmp:rethrow() }) }
    catch ($ex) {
      switch($ex/error:code)
      (: Decoding errors. Extend as needed. :)
      case 'XDMP-DOCUTF8SEQ' return local:http-get-body(
        $body, $type, subsequence($encodings, 2))
      default return xdmp:rethrow() }
};

declare function local:content-type(
  $content-type as xs:string?)
as xs:string
{
  (: Figure out the node type from the content type.
   : Extend as needed.
   :)
  if (contains($content-type, 'text/xml')) then 'xml'
  else if (contains($content-type, 'text/')) then 'text'
  else 'binary'
};

declare function local:http-get(
  $uri as xs:string,
  $encodings as xs:string*)
as node()+
{
  (: Binary is safe for any encoding. :)
  let $response := xdmp:http-get(
    $uri,
    <options xmlns="xdmp:http">
      <verify-cert>false</verify-cert>
      <format xmlns="xdmp:document-get">binary</format>
    </options>)
  let $meta := $response[1]
  return (
    $meta,
    local:http-get-body(
      subsequence($response, 2),
      local:content-type($meta/xhttp:headers/xhttp:content-type),
      $encodings))
};

declare function local:http-get(
  $uri as xs:string)
as node()+
{
  (: Extend the list of fallback encodings as needed. :)
  local:http-get($uri, ('auto', 'ISO-8859-1'))
};

local:http-get(
  "http://www.elmoudjahid.com/fr/actualites/64076/?comopen";)

On 20 Aug 2014, at 13:03 , Jakob Fix <[email protected]> wrote:

> Hi,
> 
> I'm encountering problems when attempting to retrieve pages from websites 
> over which I have no control and about which I can only learn from the 
> response headers (which may not be providing the correct information).
> 
> For example, I'm attempting to retrieve pages regardless of their encoding. 
> xdmp:http-get throws an error if the remote resource is not UTF-8 encoded:
> 
> let $http-get-options := <options xmlns="xdmp:http">
>   <verify-cert>false</verify-cert>
> </options>
> 
> return 
> xdmp:http-get("http://www.elmoudjahid.com/fr/actualites/64076/?comopen";, 
> $http-get-options)
> 
> => XDMP-DOCUTF8SEQ: 
> xdmp:http-get("http://www.elmoudjahid.com/fr/actualites/64076/?comopen";, 
> <options xmlns="xdmp:http"><verify-cert>false</verify-cert><repair 
> xmlns="xdmp:document-get...</options>) -- Invalid UTF-8 escape sequence at 
> http://www.elmoudjahid.com/fr/actualites/64076/?comopen line 299 -- document 
> is not UTF-8 encoded
> 
> If I add the option <encoding xmlns="xdmp:document-get">ISO-8859-1</encoding> 
> the contents is retrieved as expected. I could inspect the content-type 
> header which *may* contain encoding information (or not, or it may not 
> actually be true). I could brute-force via nested try/catch constructs a 
> number of probable encodings, but I'd hope there is more intelligent ... ?
> 
> 
> Another use case is the retrieval of pages via HTTPS where the certificate is 
> no longer valid. As you can see from the example above, I'm setting the 
> <verify-cert> option to false by default, because I really don't care whether 
> the webmaster looks after their certificates (should I?).
> 
> 
> Then there are the 30x responses which may/do contain Location headers or 
> more redirects for the final resource location.
> 
> 
> As you can see/are probable aware, there are lots of parameters that may 
> cause xdmp:http-get to not return the expected result.  Does someone have a 
> wrapper function up their sleeves that they don't mind sharing?  Or point me 
> to an algorithm or an implementation (or at least to a complete list of above 
> issues)? I've written my own recursive function to resolve 30x responses, but 
> it's just a little thing among many others.
> 
> cheers,
> Jakob.
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to