Re: [Chicken-users] Parsing HTML, best practice with Chicken
Hi, On Mon, 29 Dec 2014 12:12:22 +0100 Kooda ko...@upyum.com wrote: ;; --- member? returns #t if elemnt x is in list lst. ;; --- ref: ;; --- http://stackoverflow.com/questions/14668616/scheme-fold-map-and-filter-functions ;; --- use: (member? a (list a 1)) -- #t (define (member? x lst) (fold (lambda (e r) (or r (equal? e x))) #f lst)) This function already exists, it’s called `member` and is in the srfi-1 library. It's actually in the Scheme specification: http://www.schemers.org/Documents/Standards/R5RS/HTML/r5rs-Z-H-9.html#%_idx_432 `member' from SRFI-1 provides an extension to allow the equality procedure to be passed in as an extra argument. Best wishes. Mario -- http://parenteses.org/mario ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Parsing HTML, best practice with Chicken
On Mon, Dec 29, 2014 at 03:28:15AM +0100, mfv wrote: So far, I have been getting the site with http-client, the raw html to sxml with html-parser, and trying to process the resulting list with matchable/srfi-13. I would recommend avoiding that, as it can get really messy. sxpath is meant for this sort of thing, but unfortunately it's really difficult to use IMO. I somehow always manage to get it working with sxpath when I need to do some web scraping, but it's somewhat painful. I am not sure how much good it will do to use regex on those lists. You can't, in general. Neither would I recommend this, except perhaps when parsing the text content (and even then it might fail due to inline markup). Are there any packages like Python's Beautifulsoup in the Chicken arsenal? That sort of thing is sorely lacking. There's a promising zipper library written by Moritz Heidkamp, but so far it's unreleased and undocumented. If you're feeling very adventurous you could have a look at it: https://bitbucket.org/DerGuteMoritz/zipper There also used to be an sxml-match egg for CHICKEN 3, but nobody's bothered to port it to CHICKEN 4 so far. AFAIK its main advantage was that it was exactly like matchable, but document order-insensitive for attribute nodes. ; grab a website (define lnk ; http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773;) (define raw (with-input-from-request lnk #f read-string)) ;; convert site crawl data from html to sxml (define sxml (html-sxml raw)) This can be done directly, without creating an intermediate large string, by using html-sxml on a port: (define sxml (call-with-input-request lnk #f html-sxml)) In fact, I didn't even know you could use html-sxml on a string. This seems to be an undocumented feature of html-parser :) Cheers, Peter -- http://www.more-magic.net ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Parsing HTML, best practice with Chicken
Hey! Sxml-transform and other sxml related eggs can certainly help you here, but I don’t know them really well so I can’t help you with that. thanks, I will look into that. ;; saving function (define (savedata somedata filename) (call-with-output-file filename (lambda (p) (let f ((ls somedata)) (unless (null? ls) (display (car ls) p) ; changed: display-write (newline p) (f (cdr ls))) Here you can simply use `write` instead of your big function. `pp` can also be useful if you want to read the resulting file with a text editor. Oh, so you do not mean just to replace 'display' with 'write'? As I remember, I need to open a port anyway, or am I mistaken? That is why I wrote that big function, actually. ;; --- member? returns #t if elemnt x is in list lst. This function already exists, it’s called `member` and is in the srfi-1 library. Bugger. You are right. I should check chickadee more often. ;; --- string-contains/m returns #t if all strings of list lsstr are in (define (string-contains/m str lsstr) (if (string? str) (if (not (member? #f (map (lambda (x) (string-contains-ci str x)) lsstr))) #t))) This looks wrong to me, your function can return an unspecified value, try with this: And again: Bingo. I had lots of undefined values, and I really wondered where they came from. I am still puzzled how undefined is generated. It can not come from the (if (string? str) ... clause. Or does it? I understand that you used 'and' and remove one redundant check with if. But what form produced the #undefined output? I hope this will help you. Yes, thank you very much. Your function worked perfectly fine! Cheers, Piotr ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Happy Christmas
Happy holidays from California, USA! Cheers, Kevin Wortman On Sat, Dec 27, 2014 at 1:06 AM, Karel Miklav ka...@lovetemple.net wrote: Happy holidays Felix, the rest of the Chicken team and everybody else on this list. Thank you for the good work! Karel ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Parsing HTML, best practice with Chicken
Hello Piotr, The neuromorpho egg is a scraper-like utility to fetch information from a public database with neuronal reconstructions. You can look at the code for examples of page scraping with sxpath. In particular, take a look at the procedures table-alist, extract-metadata, extract-pages-from-search-results. Obviously these are specific to the particular page structure served by NeuroMorpho, but this might help. -Ivan On Sun, Dec 28, 2014 at 6:28 PM, mfv m...@freeshell.de wrote: Hello, I am currently playing around the Chicken and the web. More precisely, I want to make some web link collection and see how well it goes for me when scraping web sites for links and content. Which eggs would you recommend for that? What should I avoid doing? So far, I have been getting the site with http-client, the raw html to sxml with html-parser, and trying to process the resulting list with matchable/srfi-13. I am not sure how much good it will do to use regex on those lists. Are there any packages like Python's Beautifulsoup in the Chicken arsenal? So far, I have some troubles when trying to parse the resulting sxml, both with matchable and string-contains. Cheers, Piotr ps: ze code so far: ;; version 0.0.3 ; high level HTTP client, HTML/SXML parsing library and regular expression ; library (use http-client html-parser matchable srfi-13) ; grab a website (define lnk ; http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773;) (define raw (with-input-from-request lnk #f read-string)) ;; convert site crawl data from html to sxml (define sxml (html-sxml raw)) ;; saving function ;; * display form is more suitable, for it evaluates all those \n and other ;; * specials characters;; * might be good to remove these things from regex ;; * processing, too. (define (savedata somedata filename) (call-with-output-file filename (lambda (p) (let f ((ls somedata)) (unless (null? ls) (display (car ls) p) ; changed: display-write (newline p) (f (cdr ls))) ; check how much the output is parsable.. (savedata sxml output.txt) ;; non-TCO (define (flatten x) (cond ((null? x) '()) ((not (pair? x)) (list x)) (else (append (flatten (car x)) (flatten (cdr x)) (define sxmlflat (flatten sxml)) ;; *** ;; Multi-check procedure is needed to check whether STRING element has: ;; journal-id: 10.1002 ;; link string: issuetoc ;; ;; function: ;; takes list of strings and checks wheather the element has them. ;; AND operator. ;; *** ;; --- member? returns #t if elemnt x is in list lst. ;; --- ref: ;; --- http://stackoverflow.com/questions/14668616/scheme-fold-map-and-filter-functions ;; --- use: (member? a (list a 1)) -- #t (define (member? x lst) (fold (lambda (e r) (or r (equal? e x))) #f lst)) ;; --- string-contains/m returns #t if all strings of list lsstr are in ;; --- string str. ;; --- case insensitive string matching. ;; --- does not check if lsstr is empty. This would return #t. ;; --- use: (string-contains/m Somestring '(10.1002 issuetoc) (define (string-contains/m str lsstr) (if (string? str) (if (not (member? #f (map (lambda (x) (string-contains-ci str x)) lsstr))) #t))) (savedata (filter (lambda (x) (string-contains/m x '(10.1002 http://; toc))) sxmlflat) filtered3.txt) ;; Something is wrong with those bloody strings! ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Parsing HTML, best practice with Chicken
Hello, I somehow always manage to get it working with sxpath when I need to do some web scraping, but it's somewhat painful. Thanks, I will have a look at sxpath. Are there any packages like Python's Beautifulsoup in the Chicken arsenal? That sort of thing is sorely lacking. There's a promising zipper library written by Moritz Heidkamp, but so far it's unreleased and undocumented. If you're feeling very adventurous you could have a look at it: https://bitbucket.org/DerGuteMoritz/zipper Pity. I will have a look at the BeautifulSoup source. Maybe I can copy/mimic some sort of its functionality. And yes, I will have a look at 'zipper'. (define sxml (call-with-input-request lnk #f html-sxml)) You are right. It is step by step for me, and I am in the first steps.. (-; In fact, I didn't even know you could use html-sxml on a string. This seems to be an undocumented feature of html-parser :) I actually just tried it, as I had great difficulties in understanding the actual documentation of html-parser. No idea what it does under the hood - espcecially with all those :start:, :end:, :process: commands - and I did not have the time to glimpse into the source. All in all, I must say that it is much more difficult to get going with Chicken then with Python. The overall language is simple, but the learning curve is fairly steep - and I am not sure whether it will pay off. It terms of tooling, Python/Threading/Beautifulsoup might be the winner here. It is a simple 'hack-away' experience. But I guess that does not make me learn new tricks... My hope is that scheme is some sort of entry door into LFE/Clojure and makes me think more about algorithms. Regards, Piotr ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Parsing HTML, best practice with Chicken
On Mon, Dec 29, 2014 at 07:47:33PM +0100, mfv wrote: All in all, I must say that it is much more difficult to get going with Chicken then with Python. The overall language is simple, but the learning curve is fairly steep - and I am not sure whether it will pay off. Hm, that's unfortunate. However, I've heard this complaint before. Do you have any tips on how we can improve the situation? Cheers, Peter -- http://www.more-magic.net ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Happy Christmas
Is it too late to join to the Happy holidays sentiment? I hope you guys had (and/or are having) a great holiday season. Cheers, Pedro. On Mon, Dec 29, 2014 at 1:56 PM, Kevin Wortman kwort...@gmail.com wrote: Happy holidays from California, USA! Cheers, Kevin Wortman On Sat, Dec 27, 2014 at 1:06 AM, Karel Miklav ka...@lovetemple.net wrote: Happy holidays Felix, the rest of the Chicken team and everybody else on this list. Thank you for the good work! Karel ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users -- T: +1 (416) - 357.5356 Skype ID: pmelendezu ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Happy Christmas
I'm a little late too! I also want to wish everyone a wonderful vacation. And a happy new year with many blessings! K. On Dec 29, 2014 9:15 PM, Pedro Melendez pmelen...@pevicom.com wrote: Is it too late to join to the Happy holidays sentiment? I hope you guys had (and/or are having) a great holiday season. Cheers, Pedro. On Mon, Dec 29, 2014 at 1:56 PM, Kevin Wortman kwort...@gmail.com wrote: Happy holidays from California, USA! Cheers, Kevin Wortman On Sat, Dec 27, 2014 at 1:06 AM, Karel Miklav ka...@lovetemple.net wrote: Happy holidays Felix, the rest of the Chicken team and everybody else on this list. Thank you for the good work! Karel ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users -- T: +1 (416) - 357.5356 Skype ID: pmelendezu ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Parsing HTML, best practice with Chicken
On Tue, Dec 30, 2014 at 3:47 AM, mfv m...@freeshell.de wrote: Hello, I somehow always manage to get it working with sxpath when I need to do some web scraping, but it's somewhat painful. Thanks, I will have a look at sxpath. Are there any packages like Python's Beautifulsoup in the Chicken arsenal? That sort of thing is sorely lacking. There's a promising zipper library written by Moritz Heidkamp, but so far it's unreleased and undocumented. If you're feeling very adventurous you could have a look at it: https://bitbucket.org/DerGuteMoritz/zipper Pity. I will have a look at the BeautifulSoup source. Maybe I can copy/mimic some sort of its functionality. html-parser is intended to be the parsing side of BeautifulSoup. The idea is to do one thing well, and leave it up to other libraries to do matching and extraction. As Peter says, matchable can be cumbersome here because it doesn't do unordered matching. If you find any bugs or surprising behavior in html-parser please let me know. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users