I think the reason it is parsed into a vector of start and end tags is
because it is much simpler when not all of the html data is available.
Or you are dealing with broken html code. There is no real XPath
support in any Factor vocab as far as I'm aware of. I once wrote a
half-completed binding for libxml2 (which has XPath support and a lot
of other goodies) when I also needed it, but then I got side-tracked
with other things I wanted to build. And the words in
html.parser.analyzer were "good enough" for my use case. It's not so
hard to use them to do the same kind of querying you would with XPath.

So for example, if you have the result of
"https://news.ycombinator.com/"; scrape-html nip on the stack:

//a//text() ->
[ name>> "a" = ] find-between-all [ [ name>> text = ] filter
[ text>> ] map " " join ] map

//@href ->
[ "href" attribute ] map sift

//table[@class="itemlist"]/td[@class="storylink"]/(text() or @href) ->
[ "itemlist" html-class? ] find-between-all first
[ "storylink" html-class? ] find-between-all
[ [ first "href" attribute ] [ second text>> ] bi 2array ] map

XPath expressions look better, but this works just fine.

2016-11-19 0:32 GMT+01:00  <pet...@riseup.net>:
> Hello again :)
>
> I'm looking at implemented options of scraping web pages? I've hit into
> this
>
> http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html
>
> but that's a json output and I'm looking at pages that only have html. I
> see there's parse-html and scrape-html to parse a url into a vector,
> which seems like an html tree flattened to an (event) stream. I'm left
> to wonder about the choice as it is unusual to my eyes, but I found
> there's a bunch of words working with the output in
> html.parser.analyzer. I've fiddled around with it and found my way
> around to extract some components I was looking for.
>
> So now I'm wondering - is there anything else I've missed. Is there
> something that parses html into a tree structure? Is there some simpler
> DSL to extract data? The common cases I hit into are XPath and CSS
> selectors, which are short and to the point, but I'm fine with w/e that
> is easy enough and has the same power. So basically I'm just looking for
> more tips or options in case I missed something. You guys have a lot of
> vocabs :)
>
> --
> ------------
>    Peter Nagy
> ------------
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk



-- 
mvh/best regards Björn Lindqvist

------------------------------------------------------------------------------
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Reply via email to