Thank you, Bjorn, I found html.parser.analyzer to be the same way, _good enough_ .
Cheers On 2016-11-19 01:42, Björn Lindqvist wrote: > I think the reason it is parsed into a vector of start and end tags is > because it is much simpler when not all of the html data is available. > Or you are dealing with broken html code. There is no real XPath > support in any Factor vocab as far as I'm aware of. I once wrote a > half-completed binding for libxml2 (which has XPath support and a lot > of other goodies) when I also needed it, but then I got side-tracked > with other things I wanted to build. And the words in > html.parser.analyzer were "good enough" for my use case. It's not so > hard to use them to do the same kind of querying you would with XPath. > > So for example, if you have the result of > "https://news.ycombinator.com/" scrape-html nip on the stack: > > //a//text() -> > [ name>> "a" = ] find-between-all [ [ name>> text = ] filter > [ text>> ] map " " join ] map > > //@href -> > [ "href" attribute ] map sift > > //table[@class="itemlist"]/td[@class="storylink"]/(text() or @href) -> > [ "itemlist" html-class? ] find-between-all first > [ "storylink" html-class? ] find-between-all > [ [ first "href" attribute ] [ second text>> ] bi 2array ] map > > XPath expressions look better, but this works just fine. > > 2016-11-19 0:32 GMT+01:00 <pet...@riseup.net>: >> Hello again :) >> >> I'm looking at implemented options of scraping web pages? I've hit >> into >> this >> >> http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html >> >> but that's a json output and I'm looking at pages that only have html. >> I >> see there's parse-html and scrape-html to parse a url into a vector, >> which seems like an html tree flattened to an (event) stream. I'm left >> to wonder about the choice as it is unusual to my eyes, but I found >> there's a bunch of words working with the output in >> html.parser.analyzer. I've fiddled around with it and found my way >> around to extract some components I was looking for. >> >> So now I'm wondering - is there anything else I've missed. Is there >> something that parses html into a tree structure? Is there some >> simpler >> DSL to extract data? The common cases I hit into are XPath and CSS >> selectors, which are short and to the point, but I'm fine with w/e >> that >> is easy enough and has the same power. So basically I'm just looking >> for >> more tips or options in case I missed something. You guys have a lot >> of >> vocabs :) >> >> -- >> ------------ >> Peter Nagy >> ------------ >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Factor-talk mailing list >> Factor-talk@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/factor-talk -- ------------ Peter Nagy ------------ ------------------------------------------------------------------------------ _______________________________________________ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk