I think the reason it is parsed into a vector of start and end tags is because it is much simpler when not all of the html data is available. Or you are dealing with broken html code. There is no real XPath support in any Factor vocab as far as I'm aware of. I once wrote a half-completed binding for libxml2 (which has XPath support and a lot of other goodies) when I also needed it, but then I got side-tracked with other things I wanted to build. And the words in html.parser.analyzer were "good enough" for my use case. It's not so hard to use them to do the same kind of querying you would with XPath.
So for example, if you have the result of "https://news.ycombinator.com/" scrape-html nip on the stack: //a//text() -> [ name>> "a" = ] find-between-all [ [ name>> text = ] filter [ text>> ] map " " join ] map //@href -> [ "href" attribute ] map sift //table[@class="itemlist"]/td[@class="storylink"]/(text() or @href) -> [ "itemlist" html-class? ] find-between-all first [ "storylink" html-class? ] find-between-all [ [ first "href" attribute ] [ second text>> ] bi 2array ] map XPath expressions look better, but this works just fine. 2016-11-19 0:32 GMT+01:00 <pet...@riseup.net>: > Hello again :) > > I'm looking at implemented options of scraping web pages? I've hit into > this > > http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html > > but that's a json output and I'm looking at pages that only have html. I > see there's parse-html and scrape-html to parse a url into a vector, > which seems like an html tree flattened to an (event) stream. I'm left > to wonder about the choice as it is unusual to my eyes, but I found > there's a bunch of words working with the output in > html.parser.analyzer. I've fiddled around with it and found my way > around to extract some components I was looking for. > > So now I'm wondering - is there anything else I've missed. Is there > something that parses html into a tree structure? Is there some simpler > DSL to extract data? The common cases I hit into are XPath and CSS > selectors, which are short and to the point, but I'm fine with w/e that > is easy enough and has the same power. So basically I'm just looking for > more tips or options in case I missed something. You guys have a lot of > vocabs :) > > -- > ------------ > Peter Nagy > ------------ > > ------------------------------------------------------------------------------ > _______________________________________________ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk -- mvh/best regards Björn Lindqvist ------------------------------------------------------------------------------ _______________________________________________ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk