subject:"\[racket\-users\] Html to text, how to obtain a rough preview"

Re: [racket-users] Html to text, how to obtain a rough preview

2017-06-01 Thread Neil Van Dyke

Aside on a security issue with some example code...  I know in this case 
it was being done as prototype/experiment, which is perfectly fine, but 
since people often learn from code they see on the email list, we should 
probably mention...



(system/exit-code (string-append
   "gnome-web-photo "
   "--mode=thumbnail "
   (url->string url)
   " --file "
   (some-system-path->string
f)
   " > /dev/null 2>&1"))



In production-quality code, you will almost never run an external 
process like this.  With "system/exit-code", there is an extra layer of 
Posix shell command line interpretation happening on the string here, 
which is really nasty.


Assembling a string like this for the shell to then parse tends (through 
accident or intent) to lead to garbled command lines, and misbehavior.  
If some unexpected characters wind up in one of the non-static string 
values that's concatenated into the command line, you potentially run 
arbitrary code.  https://xkcd.com/327/


In production code, you'd probably use some variant of Racket 
`subprocess` or `system*`, which don't involve the shell like variants 
of `system` do.  Or have rigorous escaping or checking code, to make 
sure that no problematic string value is added to the command line.


Also, in production, you might want to save stderr and possibly stdout 
from the process, in case the process fails.  That diagnostic 
information can then be added added to whatever exception or logging 
your Racket program does for the failure.


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

2017-06-01 Thread Erich Rast

Thanks you all so much for your help!

Philip McGrath's method seems to overall work best, although many
sites create a parsing error.

However, I found a potentially better solution on linux that I might
use if I can get it to work cross-platform. There is an ingenious tool
called gnome-web-photo that allows you to make a thumbnail from an url!
It's too slow for a live preview but images can be cached. Of course,
this has nothing to do with the original question about a text-only
preview.

Here is the code, in case someone might need it:

#lang racket
(require net/url
 racket/system
 (prefix-in pict: pict))

(provide get-url-preview-image)

(define (get-url-preview-image url callback)
  (case (system-type 'os)
((unix) (get-url-preview-image/unix url callback))
(else (callback #f

(define (get-url-preview-image/unix url callback)
  (void
   (thread
(lambda ()
  (let* ((f (make-temporary-file "projects_temp~a.png"))
 (result (system/exit-code (string-append "gnome-web-photo
" "--mode=thumbnail "
  (url->string url)
  " --file "
  (some-system-path->string
f)
  " > /dev/null
2>&1"
(cond
  ((= result 0)
   (callback (pict:bitmap f))
   (delete-file f))
  (else
   (delete-file f)
   (callback #f

(get-url-preview-image (string->url "http://www.nytimes.com;)
   (lambda (img)
 (when img
   (pict:show-pict img


Best,

Erich

On Wed, 31 May 2017 01:34:26 -0400
Neil Van Dyke  wrote:

> Erich Rast wrote on 05/30/2017 04:37 PM:
> > I've found out that it's far less trivial than expected, but not
> > because of sxml or the tree walking itself. Often call/input-url
> > just returns '() or '(*TOP*),  
> 
> To troubleshoot this... First, I'd take a quick look at the
> application code.  Then I'd check whether the HTTP request is
> returning the HTML, and copy that HTML.  Then I'd feed that copied
> HTML to `html-parsing`, to see whether there's something weird about
> the HTML that has discovered a bug in the parser.  Then I'd look more
> closely at the surrounding application code, to see whether it's
> introducing the problem, and to see whether it's doing important
> error-checking.  Then I'd see whether it's some transient cause
> (especially in the successful downloading of HTML), or a heisenbug.
> Something like that.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Neil Van Dyke




Erich Rast wrote on 05/30/2017 04:37 PM:

I've found out that it's far less trivial than expected, but not
because of sxml or the tree walking itself. Often call/input-url
just returns '() or '(*TOP*),


To troubleshoot this... First, I'd take a quick look at the application 
code.  Then I'd check whether the HTTP request is returning the HTML, 
and copy that HTML.  Then I'd feed that copied HTML to `html-parsing`, 
to see whether there's something weird about the HTML that has 
discovered a bug in the parser.  Then I'd look more closely at the 
surrounding application code, to see whether it's introducing the 
problem, and to see whether it's doing important error-checking.  Then 
I'd see whether it's some transient cause (especially in the successful 
downloading of HTML), or a heisenbug.  Something like that.



  and sometimes it also fails with an
exception on https addresses.


If this is happening *consistently* on particular HTTPS URL domains and 
ports, without knowing more about how the failure exhibits, I'd start to 
think SSL/TLS version problem, or a certificate authentication problem.


If this is happening *intermittently* for a particular HTTPS URL domain 
and port, then I'd have to know more about the failure exhibits, and 
from what examples of requests, to start to troubleshoot.



  Then some websites also seem to be fully
dynamic with javascript and just return a lot of gobbledygook.


Sadly, JavaScript uses in practice mean that any HTML scraper 
generalized to *all* Web sites now has to basically perform the 
(anti-engineering, cracksmoking) atrocity that is the modern Web page 
load, including running JS for the DOM, and perhaps some sense of 
layout/rendering semantics.  I don't currently have tools set up for 
this, but it's doable.


If you're doing scraping of only a small number of specific Web 
sites/pages, that's a different problem, and the best path might be to 
handle any quirks of each one individually.  Especially if you don't 
want to lose some information that is interpretable to humans, or that 
requires some interaction, but is lost if you just do a generic JS page 
load and scrape.


Also note that each site's layout and interaction structure is a moving 
target.  Every now and then, they'll change their layouts/UI, their 
development frameworks or CMSs, their CDNs.  I've been doing Web 
scraping since the mid-1990s (starting with my own Java parser, before I 
wrote the current Scheme/Racket one), and I currently do things like 
maintain some metadata about which CDNs which sites are using, and what 
the anti-privacy and anti-security hooks are... and something is always 
changing, on some site.


I'm happy to offer any tips that I can, as well as fix any bug found in 
`html-parsing`.  Getting into the details of a particular site or code, 
OTOH, can take a lot of work, and is part of how I make a living as a 
consultant. :)  As always, other Racketeers and I will help on the email 
list, to the extent we can.


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Philip McGrath

I would handle this by adding some special cases to ignore the content of
script tags, extract alt text for images when it's provided, etc.

This gets meaningful content from both nytimes.com and cnn.com (though CNN
seems to only have their navigation links accessible without JavaScript):

#lang racket

(require net/url
 html-parsing
 sxml
 xml
 xml/path
 )

(define (fetch url)
  (call/input-url url
  get-pure-port
  port->string))

(define (html->string-content src-str)
  (let loop ([to-go (se-path*/list
 '(body)
 (xml->xexpr
  (document-element
   (read-xml
(open-input-string
 (srl:sxml->xml-noindent
  (html->xexp src-str)))]
 [so-far ""])
(match to-go
  ['() so-far]
  [(cons (? string? str)
 more)
   (loop more (string-append so-far str))]
  [(cons (cons (or 'style 'script) _)
 more)
   (loop more so-far)]
  [(cons (list-rest 'img
(list-no-order (list 'alt alt-text)
   _ ...)
_)
 more)
   (loop more (string-append so-far alt-text))]
  [(cons (list-rest _ _ body)
 more)
   (loop more (loop body so-far))]
  [(cons _ more) ;ignore entities, CDATA, p-i for simplicity
   (loop more so-far)])))

(displayln
 (html->string-content
  (fetch (string->url "https://www.nytimes.com;

I only converted to the xml library's x-expressions because I haven't
worked with SXML before.

Probably the first thing I'd improve is the handling of whitespace, since
it's normalized in HTML, but then you'd probably want more special cases
for tags like  that should translate to whitespace in your output.

-Philip

On Tue, May 30, 2017 at 4:17 PM, Jon Zeppieri  wrote:

> ((sxpath '(// *text*)) doc)
>
> should return all (and only) the text nodes in doc. I'm not so
> familiar with the sxml-xexp compatibility stuff, so I don't know if
> you can use an xexp here or if you really need an sxml document.
>
> On Tue, May 30, 2017 at 7:08 AM, Erich Rast  wrote:
> > Hi all,
> >
> > I need a function to provide a rough textual preview (without
> > formatting except newlines) of the content of a web page.
> >
> > So far I'm using this:
> >
> > (require net/url
> >  html-parsing
> >  sxml)
> >
> > (provide fetch fetch-string-content)
> >
> > (define (fetch url)
> >   (call/input-url url
> >   get-pure-port
> >   port->string))
> >
> > (define (fetch-string-content url)
> >   (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)
> >
> > The sxpath correctly returns the body sexp, but fetch-string-content
> > still only returns an empty string or a bunch of "\n\n\n".
> >
> > I guess the problem is that sxml:text only returns what is immediately
> > below the element, and that's not what I want. There are all kinds of
> > unknown div and span tags in web pages. I'm looking for a way to get
> > a simplified version of the textual content of the html body. If I was
> > on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
> > to be cross-platform.
> >
> > Is there a sxml trick to achieve that? It doesn't need to be perfect.
> >
> > Best,
> >
> > Erich
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Racket Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to racket-users+unsubscr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Jon Zeppieri

((sxpath '(// *text*)) doc)

should return all (and only) the text nodes in doc. I'm not so
familiar with the sxml-xexp compatibility stuff, so I don't know if
you can use an xexp here or if you really need an sxml document.

On Tue, May 30, 2017 at 7:08 AM, Erich Rast  wrote:
> Hi all,
>
> I need a function to provide a rough textual preview (without
> formatting except newlines) of the content of a web page.
>
> So far I'm using this:
>
> (require net/url
>  html-parsing
>  sxml)
>
> (provide fetch fetch-string-content)
>
> (define (fetch url)
>   (call/input-url url
>   get-pure-port
>   port->string))
>
> (define (fetch-string-content url)
>   (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)
>
> The sxpath correctly returns the body sexp, but fetch-string-content
> still only returns an empty string or a bunch of "\n\n\n".
>
> I guess the problem is that sxml:text only returns what is immediately
> below the element, and that's not what I want. There are all kinds of
> unknown div and span tags in web pages. I'm looking for a way to get
> a simplified version of the textual content of the html body. If I was
> on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
> to be cross-platform.
>
> Is there a sxml trick to achieve that? It doesn't need to be perfect.
>
> Best,
>
> Erich
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Erich Rast

On Tue, 30 May 2017 11:30:00 -0400
Neil Van Dyke  wrote:

> Writing a procedure that does what you want should be pretty easy,
> and then you can fine-tune it to your particular application. Recall
> that SXML is mostly old-school Lisp lists.  The procedure is mostly
> just a straightforward recursive traversal of lists.
> 

I've found out that it's far less trivial than expected, but not
because of sxml or the tree walking itself. Often call/input-url
just returns '() or '(*TOP*), and sometimes it also fails with an
exception on https addresses. Then some websites also seem to be fully
dynamic with javascript and just return a lot of gobbledygook. 

Here's what I've come up with:

#lang racket

(require net/url
 racket/string
 html-parsing
 sxml)

(provide fetch fetch-string-content)

(define (fetch url)
  (call/input-url url
  get-pure-port
  port->string))

(define (fetch-string-content url [limit 1200] [cutoff 40])
  (extract-text ((sxpath '(html body))(html->xexp (fetch url)))
limit cutoff))

(define (walk-list li limit cutoff)
  (filter (lambda (x)
(and (string? x)
 (> (string-length x) cutoff)))
  (for/fold ([result '()])
([elem (in-list li)])
(cond
  ((list? elem)
   (if (member (car elem) '(@ *COMMENT* class src script))
   result
   (append result (walk-list elem limit cutoff
  (else (append result (list elem)))

(define (extract-text li limit cutoff)
  (define (combine s1 s2)
(string-append s1 "\n" (string-normalize-spaces s2)))
  (let loop ((l (walk-list li limit cutoff))
 (result ""))
(cond
  ((null? l) (string-trim result))
  ((> (string-length (combine result (car l))) limit)
   (string-trim result))
  (else (loop (cdr l) (combine result (car l)))

(displayln (fetch-string-content (string->url
"https://www.racket-lang.org;)))

The racket site works fine. However, try https://www.nytimes.com and you
only get javascript code, and e.g. http:www.cnn.com returns nothing. :/

I suppose I'd need way more precise parsing and some real text
summarization algorithm to get better results.

Anyway, does anybody have ideas for improvements?

Best,

Erich

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Neil Van Dyke


Erich Rast wrote on 05/30/2017 07:08 AM:
I need a function to provide a rough textual preview (without 
formatting except newlines) of the content of a web page.


Writing a procedure that does what you want should be pretty easy, and 
then you can fine-tune it to your particular application. Recall that 
SXML is mostly old-school Lisp lists.  The procedure is mostly just a 
straightforward recursive traversal of lists.  (The traversal has a few 
modes, such as for content, and for attribute lists.)



--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[racket-users] Html to text, how to obtain a rough preview

2017-05-30 Thread Erich Rast

Hi all,

I need a function to provide a rough textual preview (without
formatting except newlines) of the content of a web page.

So far I'm using this:

(require net/url
 html-parsing
 sxml)

(provide fetch fetch-string-content)

(define (fetch url)
  (call/input-url url
  get-pure-port
  port->string))

(define (fetch-string-content url)
  (sxml:text ((sxpath '(html body)) (html->xexp (fetch url)

The sxpath correctly returns the body sexp, but fetch-string-content
still only returns an empty string or a bunch of "\n\n\n".

I guess the problem is that sxml:text only returns what is immediately
below the element, and that's not what I want. There are all kinds of
unknown div and span tags in web pages. I'm looking for a way to get
a simplified version of the textual content of the html body. If I was
on Linux only I'd use "lynx -dump -nolist" in a subprocess, but it needs
to be cross-platform.

Is there a sxml trick to achieve that? It doesn't need to be perfect.

Best,

Erich

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Html to text, how to obtain a rough preview

Re: [racket-users] Html to text, how to obtain a rough preview

Re: [racket-users] Html to text, how to obtain a rough preview

Re: [racket-users] Html to text, how to obtain a rough preview

Re: [racket-users] Html to text, how to obtain a rough preview

Re: [racket-users] Html to text, how to obtain a rough preview

Re: [racket-users] Html to text, how to obtain a rough preview

[racket-users] Html to text, how to obtain a rough preview

8 matches

Site Navigation

Mail list logo

Footer information