Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Neil Van Dyke



Erich Rast wrote on 05/30/2017 04:37 PM:

I've found out that it's far less trivial than expected, but not
because of sxml or the tree walking itself. Often call/input-url
just returns '() or '(*TOP*),


To troubleshoot this... First, I'd take a quick look at the application 
code.  Then I'd check whether the HTTP request is returning the HTML, 
and copy that HTML.  Then I'd feed that copied HTML to `html-parsing`, 
to see whether there's something weird about the HTML that has 
discovered a bug in the parser.  Then I'd look more closely at the 
surrounding application code, to see whether it's introducing the 
problem, and to see whether it's doing important error-checking.  Then 
I'd see whether it's some transient cause (especially in the successful 
downloading of HTML), or a heisenbug.  Something like that.



  and sometimes it also fails with an
exception on https addresses.


If this is happening *consistently* on particular HTTPS URL domains and 
ports, without knowing more about how the failure exhibits, I'd start to 
think SSL/TLS version problem, or a certificate authentication problem.


If this is happening *intermittently* for a particular HTTPS URL domain 
and port, then I'd have to know more about the failure exhibits, and 
from what examples of requests, to start to troubleshoot.



  Then some websites also seem to be fully
dynamic with javascript and just return a lot of gobbledygook.


Sadly, JavaScript uses in practice mean that any HTML scraper 
generalized to *all* Web sites now has to basically perform the 
(anti-engineering, cracksmoking) atrocity that is the modern Web page 
load, including running JS for the DOM, and perhaps some sense of 
layout/rendering semantics.  I don't currently have tools set up for 
this, but it's doable.


If you're doing scraping of only a small number of specific Web 
sites/pages, that's a different problem, and the best path might be to 
handle any quirks of each one individually.  Especially if you don't 
want to lose some information that is interpretable to humans, or that 
requires some interaction, but is lost if you just do a generic JS page 
load and scrape.


Also note that each site's layout and interaction structure is a moving 
target.  Every now and then, they'll change their layouts/UI, their 
development frameworks or CMSs, their CDNs.  I've been doing Web 
scraping since the mid-1990s (starting with my own Java parser, before I 
wrote the current Scheme/Racket one), and I currently do things like 
maintain some metadata about which CDNs which sites are using, and what 
the anti-privacy and anti-security hooks are... and something is always 
changing, on some site.


I'm happy to offer any tips that I can, as well as fix any bug found in 
`html-parsing`.  Getting into the details of a particular site or code, 
OTOH, can take a lot of work, and is part of how I make a living as a 
consultant. :)  As always, other Racketeers and I will help on the email 
list, to the extent we can.


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: Html to text, how to obtain a rough preview

2017-05-30 Thread George Neuner
On Tue, 30 May 2017 12:08:08 +0100, Erich Rast
 wrote:

>If I was on Linux only I'd use "lynx -dump -nolist" in a subprocess,
>but it needs to be cross-platform.

As a last resort, if your definition of "cross-platform" is just
Unix/Linux and Windows, then Lynx runs on all of those.

George

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Racket Webserver add routes outside of dispatch-rules

2017-05-30 Thread Zelphir Kaltstahl
I have the following server specification:

(define (start request)
  ;; for now only calling the dispatch
  ;; we could put some action here, which shall happen before each dispatching
  (hsk-dispatch request))

(define-values (hsk-dispatch a-url)
  (dispatch-rules [("") #:method "get" overview-app]
  [("index") #:method "get" overview-app]
  [("home") #:method "get" overview-app]
  [("hsk") #:method "get" hsk-app]
  [("ajax" "hsk-1-data") #:method "get" hsk-1-data]))

(serve/servlet
  start
  #:servlet-path "/index"  ; default URL
  #:extra-files-paths (list (build-path (current-directory) "static"))  ; 
directory for static files
  #:port 8000 ; the port on which the servlet is running
  #:servlet-regexp #rx""
  #:launch-browser? false  ; should racket show the servlet running in a 
browser upon startup?
  ;; #:quit? false  ; ???
  #:listen-ip false  ; the server will listen on ALL available IP addresses, 
not only on one specified
  #:server-root-path (current-directory)
  #:file-not-found-responder respond-unknown-file)

With some requires of other modules etc.. I am adding routes in the 
dispatch-rules part of the program. However, I'd like to add routes and specify 
which procedures handle requests to them elsewhere in the program.

The reason why I want to do this is, that I want to create a procedure similar 
to what I recently saw in Chicken Scheme's web framework Awful:

http://wiki.call-cc.org/eggref/4/awful#using-ajax

There is a procedure, which generates jQuery code, which is on a web page, 
which then requests a route, which I can specify when I call the ajax procedure.

I wonder if (1) and how (2) I could specify routes like that outside of the 
dispatch-rules part of the program, so that I could get something like Awful's 
ajax procedure (3) or maybe if there even already is such a thing for Racket's 
webserver (4).

What I like about it is, that I can code everything "on the server side" and in 
Racket instead of having to switch to JavaScript at some point and still I am 
able to "connect" parts of the actual DOM elements to procedures on the server 
side. Without such a thing, it might be better to let the server send only raw 
data and handle all DOM tree logic in JavaScript in a static JavaScript file, 
because that way, I'd have knowledge about the DOM elements in the code, 
because I'd be creating them in JavaScript on the basis of that raw knowledge. 
On the other hand, if I render some HTML on the server side and send it to the 
client and the client needs to modify it and inform the server about it, I'll 
have to rely on certain ids and classes of DOM elements simply being there, 
while I had to switch context to JavaScript. This feels less clean than either 
generating all DOM elements in JavaScript code or generating everything with a 
procedure like Awful's ajax procedure.

I don't need much, probably only a few click listeners, so maybe my procedure 
could be less complex than the ajax of Awful. Or since Racket is very similar 
in Syntax, I could try to copy most of the code.

For another utf8 related reason, I cannot use Chicken Scheme at the moment, so 
I reimplemented everything in Racket, where I do not have that problem, but 
also do not know of any such ajax procedure.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Philip McGrath
I would handle this by adding some special cases to ignore the content of
script tags, extract alt text for images when it's provided, etc.

This gets meaningful content from both nytimes.com and cnn.com (though CNN
seems to only have their navigation links accessible without JavaScript):

#lang racket

(require net/url
 html-parsing
 sxml
 xml
 xml/path
 )

(define (fetch url)
  (call/input-url url
  get-pure-port
  port->string))

(define (html->string-content src-str)
  (let loop ([to-go (se-path*/list
 '(body)
 (xml->xexpr
  (document-element
   (read-xml
(open-input-string
 (srl:sxml->xml-noindent
  (html->xexp src-str)))]
 [so-far ""])
(match to-go
  ['() so-far]
  [(cons (? string? str)
 more)
   (loop more (string-append so-far str))]
  [(cons (cons (or 'style 'script) _)
 more)
   (loop more so-far)]
  [(cons (list-rest 'img
(list-no-order (list 'alt alt-text)
   _ ...)
_)
 more)
   (loop more (string-append so-far alt-text))]
  [(cons (list-rest _ _ body)
 more)
   (loop more (loop body so-far))]
  [(cons _ more) ;ignore entities, CDATA, p-i for simplicity
   (loop more so-far)])))

(displayln
 (html->string-content
  (fetch (string->url "https://www.nytimes.com;


I only converted to the xml library's x-expressions because I haven't
worked with SXML before.

Probably the first thing I'd improve is the handling of whitespace, since
it's normalized in HTML, but then you'd probably want more special cases
for tags like  that should translate to whitespace in your output.


-Philip

On Tue, May 30, 2017 at 4:17 PM, Jon Zeppieri  wrote:

> ((sxpath '(// *text*)) doc)
>
> should return all (and only) the text nodes in doc. I'm not so
> familiar with the sxml-xexp compatibility stuff, so I don't know if
> you can use an xexp here or if you really need an sxml document.
>
> On Tue, May 30, 2017 at 7:08 AM, Erich Rast  wrote:
> > Hi all,
> >
> > I need a function to provide a rough textual preview (without
> > formatting except newlines) of the content of a web page.
> >
> > So far I'm using this:
> >
> > (require net/url
> >  html-parsing
> >  sxml)
> >
> > (provide fetch fetch-string-content)
> >
> > (define (fetch url)
> >   (call/input-url url
> >   get-pure-port
> >   port->string))
> >
> > (define (fetch-string-content url)
> >   (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)
> >
> > The sxpath correctly returns the body sexp, but fetch-string-content
> > still only returns an empty string or a bunch of "\n\n\n".
> >
> > I guess the problem is that sxml:text only returns what is immediately
> > below the element, and that's not what I want. There are all kinds of
> > unknown div and span tags in web pages. I'm looking for a way to get
> > a simplified version of the textual content of the html body. If I was
> > on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
> > to be cross-platform.
> >
> > Is there a sxml trick to achieve that? It doesn't need to be perfect.
> >
> > Best,
> >
> > Erich
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Racket Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to racket-users+unsubscr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Jon Zeppieri
((sxpath '(// *text*)) doc)

should return all (and only) the text nodes in doc. I'm not so
familiar with the sxml-xexp compatibility stuff, so I don't know if
you can use an xexp here or if you really need an sxml document.

On Tue, May 30, 2017 at 7:08 AM, Erich Rast  wrote:
> Hi all,
>
> I need a function to provide a rough textual preview (without
> formatting except newlines) of the content of a web page.
>
> So far I'm using this:
>
> (require net/url
>  html-parsing
>  sxml)
>
> (provide fetch fetch-string-content)
>
> (define (fetch url)
>   (call/input-url url
>   get-pure-port
>   port->string))
>
> (define (fetch-string-content url)
>   (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)
>
> The sxpath correctly returns the body sexp, but fetch-string-content
> still only returns an empty string or a bunch of "\n\n\n".
>
> I guess the problem is that sxml:text only returns what is immediately
> below the element, and that's not what I want. There are all kinds of
> unknown div and span tags in web pages. I'm looking for a way to get
> a simplified version of the textual content of the html body. If I was
> on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
> to be cross-platform.
>
> Is there a sxml trick to achieve that? It doesn't need to be perfect.
>
> Best,
>
> Erich
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Erich Rast
On Tue, 30 May 2017 11:30:00 -0400
Neil Van Dyke  wrote:

> Writing a procedure that does what you want should be pretty easy,
> and then you can fine-tune it to your particular application. Recall
> that SXML is mostly old-school Lisp lists.  The procedure is mostly
> just a straightforward recursive traversal of lists.
> 

I've found out that it's far less trivial than expected, but not
because of sxml or the tree walking itself. Often call/input-url
just returns '() or '(*TOP*), and sometimes it also fails with an
exception on https addresses. Then some websites also seem to be fully
dynamic with javascript and just return a lot of gobbledygook. 

Here's what I've come up with:



#lang racket

(require net/url
 racket/string
 html-parsing
 sxml)

(provide fetch fetch-string-content)

(define (fetch url)
  (call/input-url url
  get-pure-port
  port->string))

(define (fetch-string-content url [limit 1200] [cutoff 40])
  (extract-text ((sxpath '(html body))(html->xexp (fetch url)))
limit cutoff))

(define (walk-list li limit cutoff)
  (filter (lambda (x)
(and (string? x)
 (> (string-length x) cutoff)))
  (for/fold ([result '()])
([elem (in-list li)])
(cond
  ((list? elem)
   (if (member (car elem) '(@ *COMMENT* class src script))
   result
   (append result (walk-list elem limit cutoff
  (else (append result (list elem)))

(define (extract-text li limit cutoff)
  (define (combine s1 s2)
(string-append s1 "\n" (string-normalize-spaces s2)))
  (let loop ((l (walk-list li limit cutoff))
 (result ""))
(cond
  ((null? l) (string-trim result))
  ((> (string-length (combine result (car l))) limit)
   (string-trim result))
  (else (loop (cdr l) (combine result (car l)))

(displayln (fetch-string-content (string->url
"https://www.racket-lang.org;)))



The racket site works fine. However, try https://www.nytimes.com and you
only get javascript code, and e.g. http:www.cnn.com returns nothing. :/

I suppose I'd need way more precise parsing and some real text
summarization algorithm to get better results.

Anyway, does anybody have ideas for improvements?

Best,

Erich

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Neil Van Dyke

Erich Rast wrote on 05/30/2017 07:08 AM:
I need a function to provide a rough textual preview (without 
formatting except newlines) of the content of a web page.


Writing a procedure that does what you want should be pretty easy, and 
then you can fine-tune it to your particular application. Recall that 
SXML is mostly old-school Lisp lists.  The procedure is mostly just a 
straightforward recursive traversal of lists.  (The traversal has a few 
modes, such as for content, and for attribute lists.)



--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Erich Rast
Hi all,

I need a function to provide a rough textual preview (without
formatting except newlines) of the content of a web page.

So far I'm using this:

(require net/url
 html-parsing
 sxml)

(provide fetch fetch-string-content)

(define (fetch url)
  (call/input-url url
  get-pure-port
  port->string))

(define (fetch-string-content url)
  (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)

The sxpath correctly returns the body sexp, but fetch-string-content
still only returns an empty string or a bunch of "\n\n\n".

I guess the problem is that sxml:text only returns what is immediately
below the element, and that's not what I want. There are all kinds of
unknown div and span tags in web pages. I'm looking for a way to get
a simplified version of the textual content of the html body. If I was
on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
to be cross-platform.

Is there a sxml trick to achieve that? It doesn't need to be perfect.

Best,

Erich

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.