Re: [Chicken-users] Parsing HTML, best practice with Chicken

2014-12-29 Thread Mario Domenech Goulart
Hi,

On Mon, 29 Dec 2014 12:12:22 +0100 Kooda ko...@upyum.com wrote:

 ;; --- member? returns #t if elemnt x is in list lst.
 ;; --- ref:
 ;; --- 
 http://stackoverflow.com/questions/14668616/scheme-fold-map-and-filter-functions
 ;; --- use: (member? a (list a 1)) -- #t
 (define (member? x lst)
   (fold (lambda (e r)
   (or r (equal? e x)))
 #f lst))

 This function already exists, it’s called `member` and is in the
 srfi-1 library.

It's actually in the Scheme specification:
http://www.schemers.org/Documents/Standards/R5RS/HTML/r5rs-Z-H-9.html#%_idx_432

`member' from SRFI-1 provides an extension to allow the equality
procedure to be passed in as an extra argument.

Best wishes.
Mario
-- 
http://parenteses.org/mario

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Parsing HTML, best practice with Chicken

2014-12-29 Thread Peter Bex
On Mon, Dec 29, 2014 at 03:28:15AM +0100, mfv wrote:
 So far, I have been getting the site with http-client, the raw html to sxml
 with html-parser, and trying to process the resulting list with
 matchable/srfi-13.

I would recommend avoiding that, as it can get really messy.  sxpath is meant
for this sort of thing, but unfortunately it's really difficult to use IMO.

I somehow always manage to get it working with sxpath when I need to do
some web scraping, but it's somewhat painful.

 I am not sure how much good it will do to use regex on those
 lists.

You can't, in general.  Neither would I recommend this, except perhaps
when parsing the text content (and even then it might fail due to inline
markup).

  Are there any packages like Python's Beautifulsoup in the Chicken
 arsenal?

That sort of thing is sorely lacking.  There's a promising zipper
library written by Moritz Heidkamp, but so far it's unreleased and
undocumented.  If you're feeling very adventurous you could have
a look at it: https://bitbucket.org/DerGuteMoritz/zipper

There also used to be an sxml-match egg for CHICKEN 3, but nobody's
bothered to port it to CHICKEN 4 so far.  AFAIK its main advantage was
that it was exactly like matchable, but document order-insensitive for
attribute nodes.

 ; grab a website
 (define lnk
 ; http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773;)
 (define raw (with-input-from-request lnk #f read-string))
 
 ;; convert site crawl data from html to sxml
 (define sxml (html-sxml raw))

This can be done directly, without creating an intermediate
large string, by using html-sxml on a port:

(define sxml (call-with-input-request lnk #f html-sxml))

In fact, I didn't even know you could use html-sxml on a
string.  This seems to be an undocumented feature of html-parser :)

Cheers,
Peter
-- 
http://www.more-magic.net

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Parsing HTML, best practice with Chicken

2014-12-29 Thread mfv
Hey!

 Sxml-transform and other sxml related eggs can certainly help you here,
 but I don’t know them really well so I can’t help you with that.

thanks, I will look into that. 

 
 
  ;; saving function
  (define (savedata somedata filename)
(call-with-output-file filename
  (lambda (p)
(let f ((ls somedata))
  (unless (null? ls)
(display (car ls) p)   ; changed: display-write
(newline p)
(f (cdr ls)))
 

 Here you can simply use `write` instead of your big function. `pp` can
 also be useful if you want to read the resulting file with a text
 editor.

Oh, so you do not mean just to replace 'display' with 'write'? As I
remember, I need to open a port anyway, or am I mistaken? That is why I
wrote that big function, actually. 

  ;; --- member? returns #t if elemnt x is in list lst.
 This function already exists, it’s called `member` and is in the srfi-1
 library.

Bugger. You are right. I should check chickadee more often. 
 
 
  ;; --- string-contains/m returns #t if all strings of list lsstr are in
  (define (string-contains/m str lsstr)
(if (string? str) 
(if (not (member? #f (map (lambda (x) (string-contains-ci str x))
  lsstr))) #t)))
 
 This looks wrong to me, your function can return an unspecified value,
 try with this:

And again: Bingo. I had lots of undefined values, and I really wondered
where they came from. 

I am still puzzled how undefined is generated. It can not come from the 

  (if (string? str) ...

clause. Or does it? I understand that you used 'and' and remove one
redundant check with if. But what form produced the #undefined output?


 I hope this will help you.

Yes, thank you very much. Your function worked perfectly fine!

Cheers, 

  Piotr

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Happy Christmas

2014-12-29 Thread Kevin Wortman
Happy holidays from California, USA!

Cheers,
Kevin Wortman

On Sat, Dec 27, 2014 at 1:06 AM, Karel Miklav ka...@lovetemple.net wrote:

 Happy holidays Felix, the rest of the Chicken team and everybody else on
 this list.

 Thank you for the good work!

 Karel

 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Parsing HTML, best practice with Chicken

2014-12-29 Thread Ivan Raikov
Hello Piotr,

   The neuromorpho egg is a scraper-like utility to fetch information from
a public database with neuronal reconstructions.
You can look at the code for examples of page scraping with sxpath. In
particular, take a look at the procedures
table-alist, extract-metadata, extract-pages-from-search-results.
Obviously these are specific to the particular page
structure served by NeuroMorpho, but this might help.

   -Ivan




On Sun, Dec 28, 2014 at 6:28 PM, mfv m...@freeshell.de wrote:

 Hello,

 I am currently playing around the Chicken and the web. More precisely, I
 want to make some web link collection and see how well it goes for me when
 scraping web sites for links and content.

 Which eggs would you recommend for that? What should I avoid doing?

 So far, I have been getting the site with http-client, the raw html to sxml
 with html-parser, and trying to process the resulting list with
 matchable/srfi-13. I am not sure how much good it will do to use regex on
 those
 lists. Are there any packages like Python's Beautifulsoup in the Chicken
 arsenal?

 So far, I have some troubles when trying to parse the resulting sxml, both
 with
 matchable and string-contains.

 Cheers,

   Piotr


 ps: ze code so far:



 ;; version 0.0.3

 ; high level HTTP client, HTML/SXML parsing library and regular expression
 ; library
 (use http-client html-parser matchable srfi-13)

 ; grab a website
 (define lnk
 ; http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773;)
 (define raw (with-input-from-request lnk #f read-string))

 ;; convert site crawl data from html to sxml
 (define sxml (html-sxml raw))

 ;; saving function
 ;; * display form is more suitable, for it evaluates all those \n and other
 ;; * specials characters;; * might be good to remove these things from
 regex
 ;; * processing, too.
 (define (savedata somedata filename)
   (call-with-output-file filename
 (lambda (p)
   (let f ((ls somedata))
 (unless (null? ls)
   (display (car ls) p)   ; changed: display-write
   (newline p)
   (f (cdr ls)))

 ; check how much the output is parsable..
 (savedata sxml output.txt)

 ;; non-TCO
 (define (flatten x)
 (cond ((null? x) '())
   ((not (pair? x)) (list x))
   (else (append (flatten (car x))
 (flatten (cdr x))

 (define sxmlflat (flatten sxml))

 ;; ***
 ;; Multi-check procedure is needed to check whether STRING element has:
 ;;  journal-id: 10.1002
 ;;  link string: issuetoc
 ;;
 ;; function:
 ;;   takes list of strings and checks wheather the element has them.
 ;;   AND operator.
 ;; ***


 ;; --- member? returns #t if elemnt x is in list lst.
 ;; --- ref:
 ;; ---
 http://stackoverflow.com/questions/14668616/scheme-fold-map-and-filter-functions
 ;; --- use: (member? a (list a 1)) -- #t
 (define (member? x lst)
   (fold (lambda (e r)
   (or r (equal? e x)))
 #f lst))

 ;; --- string-contains/m returns #t if all strings of list lsstr are in
 ;; --- string str.
 ;; --- case insensitive string matching.
 ;; --- does not check if lsstr is empty. This would return #t.
 ;; --- use: (string-contains/m Somestring '(10.1002 issuetoc)
 (define (string-contains/m str lsstr)
   (if (string? str)
   (if (not (member? #f (map (lambda (x) (string-contains-ci str x))
 lsstr))) #t)))


 (savedata
 (filter (lambda (x) (string-contains/m x '(10.1002 http://; toc)))
 sxmlflat)
 filtered3.txt)

 ;; Something is wrong with those bloody strings!

 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Parsing HTML, best practice with Chicken

2014-12-29 Thread mfv
Hello, 

 I somehow always manage to get it working with sxpath when I need to do
 some web scraping, but it's somewhat painful.

Thanks, I will have a look at sxpath.


   Are there any packages like Python's Beautifulsoup in the Chicken
  arsenal?
 
 That sort of thing is sorely lacking.  There's a promising zipper
 library written by Moritz Heidkamp, but so far it's unreleased and
 undocumented.  If you're feeling very adventurous you could have
 a look at it: https://bitbucket.org/DerGuteMoritz/zipper

Pity. I will have a look at the BeautifulSoup source. Maybe I can copy/mimic 
some
sort of its functionality. 

And yes, I will have a look at 'zipper'.
 
 (define sxml (call-with-input-request lnk #f html-sxml))

You are right. It is step by step for me, and I am in the first steps.. (-;

 In fact, I didn't even know you could use html-sxml on a
 string.  This seems to be an undocumented feature of html-parser :)

I actually just tried it, as I had great difficulties in understanding the
actual documentation of html-parser. No idea what it does under the hood -
espcecially with all those :start:, :end:, :process: commands - and I did
not have the time to glimpse into the source. 

All in all, I must say that it is much more difficult to get going with
Chicken then with Python. The overall language is simple, but the learning
curve is fairly steep - and I am not sure whether it will pay off. 

It terms of tooling, Python/Threading/Beautifulsoup might be the winner
here. It is a simple 'hack-away' experience. But I guess that does not make
me learn new tricks...

My hope is that scheme is some sort of entry door into LFE/Clojure and makes
me think more about algorithms. 

Regards, 

  Piotr

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Parsing HTML, best practice with Chicken

2014-12-29 Thread Peter Bex
On Mon, Dec 29, 2014 at 07:47:33PM +0100, mfv wrote:
 All in all, I must say that it is much more difficult to get going with
 Chicken then with Python. The overall language is simple, but the learning
 curve is fairly steep - and I am not sure whether it will pay off. 

Hm, that's unfortunate.  However, I've heard this complaint before.
Do you have any tips on how we can improve the situation?

Cheers,
Peter
-- 
http://www.more-magic.net

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Happy Christmas

2014-12-29 Thread Pedro Melendez
Is it too late to join to the Happy holidays sentiment?

I hope you guys had (and/or are having) a great holiday season.

Cheers,

Pedro.

On Mon, Dec 29, 2014 at 1:56 PM, Kevin Wortman kwort...@gmail.com wrote:

 Happy holidays from California, USA!

 Cheers,
 Kevin Wortman

 On Sat, Dec 27, 2014 at 1:06 AM, Karel Miklav ka...@lovetemple.net
 wrote:

 Happy holidays Felix, the rest of the Chicken team and everybody else on
 this list.

 Thank you for the good work!

 Karel

 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users



 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users




-- 
T: +1 (416) - 357.5356
Skype ID: pmelendezu
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Happy Christmas

2014-12-29 Thread Kristian Lein-Mathisen
I'm a little late too!

I also want to wish everyone a wonderful vacation. And a happy new year
with many blessings!

K.
On Dec 29, 2014 9:15 PM, Pedro Melendez pmelen...@pevicom.com wrote:

 Is it too late to join to the Happy holidays sentiment?

 I hope you guys had (and/or are having) a great holiday season.

 Cheers,

 Pedro.

 On Mon, Dec 29, 2014 at 1:56 PM, Kevin Wortman kwort...@gmail.com wrote:

 Happy holidays from California, USA!

 Cheers,
 Kevin Wortman

 On Sat, Dec 27, 2014 at 1:06 AM, Karel Miklav ka...@lovetemple.net
 wrote:

 Happy holidays Felix, the rest of the Chicken team and everybody else on
 this list.

 Thank you for the good work!

 Karel

 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users



 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users




 --
 T: +1 (416) - 357.5356
 Skype ID: pmelendezu



 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users


___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Parsing HTML, best practice with Chicken

2014-12-29 Thread Alex Shinn
On Tue, Dec 30, 2014 at 3:47 AM, mfv m...@freeshell.de wrote:

 Hello,

  I somehow always manage to get it working with sxpath when I need to do
  some web scraping, but it's somewhat painful.

 Thanks, I will have a look at sxpath.


Are there any packages like Python's Beautifulsoup in the Chicken
   arsenal?
 
  That sort of thing is sorely lacking.  There's a promising zipper
  library written by Moritz Heidkamp, but so far it's unreleased and
  undocumented.  If you're feeling very adventurous you could have
  a look at it: https://bitbucket.org/DerGuteMoritz/zipper

 Pity. I will have a look at the BeautifulSoup source. Maybe I can
 copy/mimic some
 sort of its functionality.


html-parser is intended to be the parsing side of BeautifulSoup.
The idea is to do one thing well, and leave it up to other libraries
to do matching and extraction.  As Peter says, matchable can be
cumbersome here because it doesn't do unordered matching.

If you find any bugs or surprising behavior in html-parser please
let me know.

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users