On Tue, 30 May 2017 11:30:00 -0400
Neil Van Dyke <n...@neilvandyke.org> wrote:

> Writing a procedure that does what you want should be pretty easy,
> and then you can fine-tune it to your particular application. Recall
> that SXML is mostly old-school Lisp lists.  The procedure is mostly
> just a straightforward recursive traversal of lists.
> 

I've found out that it's far less trivial than expected, but not
because of sxml or the tree walking itself. Often call/input-url
just returns '() or '(*TOP*), and sometimes it also fails with an
exception on https addresses. Then some websites also seem to be fully
dynamic with javascript and just return a lot of gobbledygook. 

Here's what I've come up with:

----

#lang racket

(require net/url
         racket/string
         html-parsing
         sxml)

(provide fetch fetch-string-content)

(define (fetch url)
  (call/input-url url
                  get-pure-port
                  port->string))

(define (fetch-string-content url [limit 1200] [cutoff 40])
  (extract-text ((sxpath '(html body))(html->xexp (fetch url)))
                limit cutoff))

(define (walk-list li limit cutoff)
  (filter (lambda (x)
            (and (string? x)
                 (> (string-length x) cutoff)))
          (for/fold ([result '()])
                    ([elem (in-list li)])
            (cond
              ((list? elem)
               (if (member (car elem) '(@ *COMMENT* class src script))
                   result
                   (append result (walk-list elem limit cutoff))))
              (else (append result (list elem)))))))

(define (extract-text li limit cutoff)
  (define (combine s1 s2)
    (string-append s1 "\n" (string-normalize-spaces s2)))
  (let loop ((l (walk-list li limit cutoff))
             (result ""))
    (cond
      ((null? l) (string-trim result))
      ((> (string-length (combine result (car l))) limit)
       (string-trim result))
      (else (loop (cdr l) (combine result (car l)))))))

(displayln (fetch-string-content (string->url
"https://www.racket-lang.org";)))

----

The racket site works fine. However, try https://www.nytimes.com and you
only get javascript code, and e.g. http:www.cnn.com returns nothing. :/

I suppose I'd need way more precise parsing and some real text
summarization algorithm to get better results.

Anyway, does anybody have ideas for improvements?

Best,

Erich

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to