Re: [racket-users] Html to text, how to obtain a rough preview
Erich Rast wrote on 05/30/2017 04:37 PM: I've found out that it's far less trivial than expected, but not because of sxml or the tree walking itself. Often call/input-url just returns '() or '(*TOP*), To troubleshoot this... First, I'd take a quick look at the application code. Then I'd check whether the HTTP request is returning the HTML, and copy that HTML. Then I'd feed that copied HTML to `html-parsing`, to see whether there's something weird about the HTML that has discovered a bug in the parser. Then I'd look more closely at the surrounding application code, to see whether it's introducing the problem, and to see whether it's doing important error-checking. Then I'd see whether it's some transient cause (especially in the successful downloading of HTML), or a heisenbug. Something like that. and sometimes it also fails with an exception on https addresses. If this is happening *consistently* on particular HTTPS URL domains and ports, without knowing more about how the failure exhibits, I'd start to think SSL/TLS version problem, or a certificate authentication problem. If this is happening *intermittently* for a particular HTTPS URL domain and port, then I'd have to know more about the failure exhibits, and from what examples of requests, to start to troubleshoot. Then some websites also seem to be fully dynamic with javascript and just return a lot of gobbledygook. Sadly, JavaScript uses in practice mean that any HTML scraper generalized to *all* Web sites now has to basically perform the (anti-engineering, cracksmoking) atrocity that is the modern Web page load, including running JS for the DOM, and perhaps some sense of layout/rendering semantics. I don't currently have tools set up for this, but it's doable. If you're doing scraping of only a small number of specific Web sites/pages, that's a different problem, and the best path might be to handle any quirks of each one individually. Especially if you don't want to lose some information that is interpretable to humans, or that requires some interaction, but is lost if you just do a generic JS page load and scrape. Also note that each site's layout and interaction structure is a moving target. Every now and then, they'll change their layouts/UI, their development frameworks or CMSs, their CDNs. I've been doing Web scraping since the mid-1990s (starting with my own Java parser, before I wrote the current Scheme/Racket one), and I currently do things like maintain some metadata about which CDNs which sites are using, and what the anti-privacy and anti-security hooks are... and something is always changing, on some site. I'm happy to offer any tips that I can, as well as fix any bug found in `html-parsing`. Getting into the details of a particular site or code, OTOH, can take a lot of work, and is part of how I make a living as a consultant. :) As always, other Racketeers and I will help on the email list, to the extent we can. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Re: Html to text, how to obtain a rough preview
On Tue, 30 May 2017 12:08:08 +0100, Erich Rastwrote: >If I was on Linux only I'd use "lynx -dump -nolist" in a subprocess, >but it needs to be cross-platform. As a last resort, if your definition of "cross-platform" is just Unix/Linux and Windows, then Lynx runs on all of those. George -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Racket Webserver add routes outside of dispatch-rules
I have the following server specification: (define (start request) ;; for now only calling the dispatch ;; we could put some action here, which shall happen before each dispatching (hsk-dispatch request)) (define-values (hsk-dispatch a-url) (dispatch-rules [("") #:method "get" overview-app] [("index") #:method "get" overview-app] [("home") #:method "get" overview-app] [("hsk") #:method "get" hsk-app] [("ajax" "hsk-1-data") #:method "get" hsk-1-data])) (serve/servlet start #:servlet-path "/index" ; default URL #:extra-files-paths (list (build-path (current-directory) "static")) ; directory for static files #:port 8000 ; the port on which the servlet is running #:servlet-regexp #rx"" #:launch-browser? false ; should racket show the servlet running in a browser upon startup? ;; #:quit? false ; ??? #:listen-ip false ; the server will listen on ALL available IP addresses, not only on one specified #:server-root-path (current-directory) #:file-not-found-responder respond-unknown-file) With some requires of other modules etc.. I am adding routes in the dispatch-rules part of the program. However, I'd like to add routes and specify which procedures handle requests to them elsewhere in the program. The reason why I want to do this is, that I want to create a procedure similar to what I recently saw in Chicken Scheme's web framework Awful: http://wiki.call-cc.org/eggref/4/awful#using-ajax There is a procedure, which generates jQuery code, which is on a web page, which then requests a route, which I can specify when I call the ajax procedure. I wonder if (1) and how (2) I could specify routes like that outside of the dispatch-rules part of the program, so that I could get something like Awful's ajax procedure (3) or maybe if there even already is such a thing for Racket's webserver (4). What I like about it is, that I can code everything "on the server side" and in Racket instead of having to switch to JavaScript at some point and still I am able to "connect" parts of the actual DOM elements to procedures on the server side. Without such a thing, it might be better to let the server send only raw data and handle all DOM tree logic in JavaScript in a static JavaScript file, because that way, I'd have knowledge about the DOM elements in the code, because I'd be creating them in JavaScript on the basis of that raw knowledge. On the other hand, if I render some HTML on the server side and send it to the client and the client needs to modify it and inform the server about it, I'll have to rely on certain ids and classes of DOM elements simply being there, while I had to switch context to JavaScript. This feels less clean than either generating all DOM elements in JavaScript code or generating everything with a procedure like Awful's ajax procedure. I don't need much, probably only a few click listeners, so maybe my procedure could be less complex than the ajax of Awful. Or since Racket is very similar in Syntax, I could try to copy most of the code. For another utf8 related reason, I cannot use Chicken Scheme at the moment, so I reimplemented everything in Racket, where I do not have that problem, but also do not know of any such ajax procedure. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Html to text, how to obtain a rough preview
I would handle this by adding some special cases to ignore the content of script tags, extract alt text for images when it's provided, etc. This gets meaningful content from both nytimes.com and cnn.com (though CNN seems to only have their navigation links accessible without JavaScript): #lang racket (require net/url html-parsing sxml xml xml/path ) (define (fetch url) (call/input-url url get-pure-port port->string)) (define (html->string-content src-str) (let loop ([to-go (se-path*/list '(body) (xml->xexpr (document-element (read-xml (open-input-string (srl:sxml->xml-noindent (html->xexp src-str)))] [so-far ""]) (match to-go ['() so-far] [(cons (? string? str) more) (loop more (string-append so-far str))] [(cons (cons (or 'style 'script) _) more) (loop more so-far)] [(cons (list-rest 'img (list-no-order (list 'alt alt-text) _ ...) _) more) (loop more (string-append so-far alt-text))] [(cons (list-rest _ _ body) more) (loop more (loop body so-far))] [(cons _ more) ;ignore entities, CDATA, p-i for simplicity (loop more so-far)]))) (displayln (html->string-content (fetch (string->url "https://www.nytimes.com; I only converted to the xml library's x-expressions because I haven't worked with SXML before. Probably the first thing I'd improve is the handling of whitespace, since it's normalized in HTML, but then you'd probably want more special cases for tags like that should translate to whitespace in your output. -Philip On Tue, May 30, 2017 at 4:17 PM, Jon Zeppieriwrote: > ((sxpath '(// *text*)) doc) > > should return all (and only) the text nodes in doc. I'm not so > familiar with the sxml-xexp compatibility stuff, so I don't know if > you can use an xexp here or if you really need an sxml document. > > On Tue, May 30, 2017 at 7:08 AM, Erich Rast wrote: > > Hi all, > > > > I need a function to provide a rough textual preview (without > > formatting except newlines) of the content of a web page. > > > > So far I'm using this: > > > > (require net/url > > html-parsing > > sxml) > > > > (provide fetch fetch-string-content) > > > > (define (fetch url) > > (call/input-url url > > get-pure-port > > port->string)) > > > > (define (fetch-string-content url) > > (sxml:text ((sxpath '(html body)) (html->xexp (fetch url) > > > > The sxpath correctly returns the body sexp, but fetch-string-content > > still only returns an empty string or a bunch of "\n\n\n". > > > > I guess the problem is that sxml:text only returns what is immediately > > below the element, and that's not what I want. There are all kinds of > > unknown div and span tags in web pages. I'm looking for a way to get > > a simplified version of the textual content of the html body. If I was > > on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs > > to be cross-platform. > > > > Is there a sxml trick to achieve that? It doesn't need to be perfect. > > > > Best, > > > > Erich > > > > -- > > You received this message because you are subscribed to the Google > Groups "Racket Users" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to racket-users+unsubscr...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "Racket Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to racket-users+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Html to text, how to obtain a rough preview
((sxpath '(// *text*)) doc) should return all (and only) the text nodes in doc. I'm not so familiar with the sxml-xexp compatibility stuff, so I don't know if you can use an xexp here or if you really need an sxml document. On Tue, May 30, 2017 at 7:08 AM, Erich Rastwrote: > Hi all, > > I need a function to provide a rough textual preview (without > formatting except newlines) of the content of a web page. > > So far I'm using this: > > (require net/url > html-parsing > sxml) > > (provide fetch fetch-string-content) > > (define (fetch url) > (call/input-url url > get-pure-port > port->string)) > > (define (fetch-string-content url) > (sxml:text ((sxpath '(html body)) (html->xexp (fetch url) > > The sxpath correctly returns the body sexp, but fetch-string-content > still only returns an empty string or a bunch of "\n\n\n". > > I guess the problem is that sxml:text only returns what is immediately > below the element, and that's not what I want. There are all kinds of > unknown div and span tags in web pages. I'm looking for a way to get > a simplified version of the textual content of the html body. If I was > on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs > to be cross-platform. > > Is there a sxml trick to achieve that? It doesn't need to be perfect. > > Best, > > Erich > > -- > You received this message because you are subscribed to the Google Groups > "Racket Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to racket-users+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Html to text, how to obtain a rough preview
On Tue, 30 May 2017 11:30:00 -0400 Neil Van Dykewrote: > Writing a procedure that does what you want should be pretty easy, > and then you can fine-tune it to your particular application. Recall > that SXML is mostly old-school Lisp lists. The procedure is mostly > just a straightforward recursive traversal of lists. > I've found out that it's far less trivial than expected, but not because of sxml or the tree walking itself. Often call/input-url just returns '() or '(*TOP*), and sometimes it also fails with an exception on https addresses. Then some websites also seem to be fully dynamic with javascript and just return a lot of gobbledygook. Here's what I've come up with: #lang racket (require net/url racket/string html-parsing sxml) (provide fetch fetch-string-content) (define (fetch url) (call/input-url url get-pure-port port->string)) (define (fetch-string-content url [limit 1200] [cutoff 40]) (extract-text ((sxpath '(html body))(html->xexp (fetch url))) limit cutoff)) (define (walk-list li limit cutoff) (filter (lambda (x) (and (string? x) (> (string-length x) cutoff))) (for/fold ([result '()]) ([elem (in-list li)]) (cond ((list? elem) (if (member (car elem) '(@ *COMMENT* class src script)) result (append result (walk-list elem limit cutoff (else (append result (list elem))) (define (extract-text li limit cutoff) (define (combine s1 s2) (string-append s1 "\n" (string-normalize-spaces s2))) (let loop ((l (walk-list li limit cutoff)) (result "")) (cond ((null? l) (string-trim result)) ((> (string-length (combine result (car l))) limit) (string-trim result)) (else (loop (cdr l) (combine result (car l))) (displayln (fetch-string-content (string->url "https://www.racket-lang.org;))) The racket site works fine. However, try https://www.nytimes.com and you only get javascript code, and e.g. http:www.cnn.com returns nothing. :/ I suppose I'd need way more precise parsing and some real text summarization algorithm to get better results. Anyway, does anybody have ideas for improvements? Best, Erich -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Html to text, how to obtain a rough preview
Erich Rast wrote on 05/30/2017 07:08 AM: I need a function to provide a rough textual preview (without formatting except newlines) of the content of a web page. Writing a procedure that does what you want should be pretty easy, and then you can fine-tune it to your particular application. Recall that SXML is mostly old-school Lisp lists. The procedure is mostly just a straightforward recursive traversal of lists. (The traversal has a few modes, such as for content, and for attribute lists.) -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[racket-users] Html to text, how to obtain a rough preview
Hi all, I need a function to provide a rough textual preview (without formatting except newlines) of the content of a web page. So far I'm using this: (require net/url html-parsing sxml) (provide fetch fetch-string-content) (define (fetch url) (call/input-url url get-pure-port port->string)) (define (fetch-string-content url) (sxml:text ((sxpath '(html body)) (html->xexp (fetch url) The sxpath correctly returns the body sexp, but fetch-string-content still only returns an empty string or a bunch of "\n\n\n". I guess the problem is that sxml:text only returns what is immediately below the element, and that's not what I want. There are all kinds of unknown div and span tags in web pages. I'm looking for a way to get a simplified version of the textual content of the html body. If I was on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs to be cross-platform. Is there a sxml trick to achieve that? It doesn't need to be perfect. Best, Erich -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.