Re: [Factor-talk] Web scraping

petern Mon, 21 Nov 2016 01:01:28 -0800

Thank you, Bjorn, I found html.parser.analyzer to be the same way, _good 
enough_ .


Cheers

On 2016-11-19 01:42, Björn Lindqvist wrote:
> I think the reason it is parsed into a vector of start and end tags is
> because it is much simpler when not all of the html data is available.
> Or you are dealing with broken html code. There is no real XPath
> support in any Factor vocab as far as I'm aware of. I once wrote a
> half-completed binding for libxml2 (which has XPath support and a lot
> of other goodies) when I also needed it, but then I got side-tracked
> with other things I wanted to build. And the words in
> html.parser.analyzer were "good enough" for my use case. It's not so
> hard to use them to do the same kind of querying you would with XPath.
> 
> So for example, if you have the result of
> "https://news.ycombinator.com/"; scrape-html nip on the stack:
> 
> //a//text() ->
> [ name>> "a" = ] find-between-all [ [ name>> text = ] filter
> [ text>> ] map " " join ] map
> 
> //@href ->
> [ "href" attribute ] map sift
> 
> //table[@class="itemlist"]/td[@class="storylink"]/(text() or @href) ->
> [ "itemlist" html-class? ] find-between-all first
> [ "storylink" html-class? ] find-between-all
> [ [ first "href" attribute ] [ second text>> ] bi 2array ] map
> 
> XPath expressions look better, but this works just fine.
> 
> 2016-11-19 0:32 GMT+01:00  <pet...@riseup.net>:
>> Hello again :)
>> 
>> I'm looking at implemented options of scraping web pages? I've hit 
>> into
>> this
>> 
>> http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html
>> 
>> but that's a json output and I'm looking at pages that only have html. 
>> I
>> see there's parse-html and scrape-html to parse a url into a vector,
>> which seems like an html tree flattened to an (event) stream. I'm left
>> to wonder about the choice as it is unusual to my eyes, but I found
>> there's a bunch of words working with the output in
>> html.parser.analyzer. I've fiddled around with it and found my way
>> around to extract some components I was looking for.
>> 
>> So now I'm wondering - is there anything else I've missed. Is there
>> something that parses html into a tree structure? Is there some 
>> simpler
>> DSL to extract data? The common cases I hit into are XPath and CSS
>> selectors, which are short and to the point, but I'm fine with w/e 
>> that
>> is easy enough and has the same power. So basically I'm just looking 
>> for
>> more tips or options in case I missed something. You guys have a lot 
>> of
>> vocabs :)
>> 
>> --
>> ------------
>>    Peter Nagy
>> ------------
>> 
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Factor-talk mailing list
>> Factor-talk@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/factor-talk

-- 
------------
   Peter Nagy
------------

------------------------------------------------------------------------------
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Web scraping

Reply via email to