Hi, 

I've just turned my cascadia from a thin wrapper around the Go Cascadia 
package <https://github.com/andybalholm/cascadia>, into a poor man's 
scrapper tool. Please check it out at
https://github.com/suntong/cascadia

Here are some exception:

The Go Cascadia package <https://github.com/andybalholm/cascadia> implements 
CSS selectors for html. This is the command line tool, started as a thin 
wrapper around that package, but growing into a better tool to test CSS 
selectors without writing Go code:

Block selection mode

First, as the single selection mode will output the selection as HTML 
source, so if you want HTML text instead, then you can make use of the 
block selection mode.


$ echo '<div class="container"><p align="justify"><b>Name: </b>John 
Doe</p></div>' | tee /tmp/cascadia.xml | cascadia -i -o -c 'div > p'
1 elements for 'div > p':<p align="justify"><b>Name: </b>John Doe</p>

$ cat /tmp/cascadia.xml | cascadia -i -o -c 'div' --piece SelText='p'
SelText
Name: John Doe


However, the real power of *block selection mode* resides in its capability 
of producing tsv/csv tables without any go programming:


$ curl --silent https://news.ycombinator.com | cascadia -i -o -c 'tr.athing' -p 
No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No      Title   Site
1.      Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016)   
microsoft.com
2.      Starting today, users of Firefox can also enjoy Netflix on Linux        
netflix.com
3.      Research Debt   distill.pub
...
27.     USPS Informed Delivery ? Digital Images of Front of Mailpieces  usps.com
28.     Performance bugs ? the dark matter of programming bugs  
forwardscattering.org
29.     Most items of clothing have complicated international journeys  
bbc.co.uk
30.     High-performance employees need quieter work spaces     qz.com


It's poor man's scrapper tool if text are the only thing needed. For 
scrapping beyond text, then just go one step further, to use 
andrew-d/goscrape <https://github.com/andrew-d/goscrape> (or my goscrape 
<https://github.com/suntong/goscrape> instead, which has some enhancements 
to it).


Again, if text are the only thing needed, then cascadia might be already 
enough. Here is how to scrap Hacker News *across several pages*:


$ curl --silent https://news.ycombinator.com/news?p=[1-3] | cascadia -i -o -c 
'tr.athing' -p No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No      Title   Site
1.      Starting today, users of Firefox can also enjoy Netflix on Linux        
netflix.com
2.      Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016)   
microsoft.com 
3.      Research Debt   distill.pub
...
27.     Yes I Still Want to Be Doing This at 56 (2012)  thecodist.com
28.     Performance bugs ? the dark matter of programming bugs  
forwardscattering.org
29.     USPS Informed Delivery ? Digital Images of Front of Mailpieces  usps.com
30.     High-performance employees need quieter work spaces     qz.com
31.     Most items of clothing have complicated international journeys  
bbc.co.uk
32.     Telstra?s Gigabit Class LTE Network     cellularinsights.com
...
58.     The New Laptop Ban Adds to Travelers' Lack of Privacy and Security      
eff.org 
59.     QEMU: user-to-root privesc inside VM via bad translation caching        
chromium.org
60.     Startups that debuted at Y Combinator W17 Demo Day 2    techcrunch.com
61.     The Cracking Monolith: Forces That Call for Microservices       
semaphoreci.com 
62.     Amsterdam Airport Launches API Platform schiphol.nl
...
88.     Founder Stories: Leah Culver of Breaker (YC W17)        ycombinator.com 
89.     Find out what you, or someone on your team, did on the last working day 
github.com
90.     PSD2 ? a directive that will change banking in Europe   evry.com


-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to