Re: [Factor-talk] Web scraping
I think the reason it is parsed into a vector of start and end tags is because it is much simpler when not all of the html data is available. Or you are dealing with broken html code. There is no real XPath support in any Factor vocab as far as I'm aware of. I once wrote a half-completed binding for libxml2 (which has XPath support and a lot of other goodies) when I also needed it, but then I got side-tracked with other things I wanted to build. And the words in html.parser.analyzer were "good enough" for my use case. It's not so hard to use them to do the same kind of querying you would with XPath. So for example, if you have the result of "https://news.ycombinator.com/; scrape-html nip on the stack: //a//text() -> [ name>> "a" = ] find-between-all [ [ name>> text = ] filter [ text>> ] map " " join ] map //@href -> [ "href" attribute ] map sift //table[@class="itemlist"]/td[@class="storylink"]/(text() or @href) -> [ "itemlist" html-class? ] find-between-all first [ "storylink" html-class? ] find-between-all [ [ first "href" attribute ] [ second text>> ] bi 2array ] map XPath expressions look better, but this works just fine. 2016-11-19 0:32 GMT+01:00: > Hello again :) > > I'm looking at implemented options of scraping web pages? I've hit into > this > > http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html > > but that's a json output and I'm looking at pages that only have html. I > see there's parse-html and scrape-html to parse a url into a vector, > which seems like an html tree flattened to an (event) stream. I'm left > to wonder about the choice as it is unusual to my eyes, but I found > there's a bunch of words working with the output in > html.parser.analyzer. I've fiddled around with it and found my way > around to extract some components I was looking for. > > So now I'm wondering - is there anything else I've missed. Is there > something that parses html into a tree structure? Is there some simpler > DSL to extract data? The common cases I hit into are XPath and CSS > selectors, which are short and to the point, but I'm fine with w/e that > is easy enough and has the same power. So basically I'm just looking for > more tips or options in case I missed something. You guys have a lot of > vocabs :) > > -- > >Peter Nagy > > > -- > ___ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk -- mvh/best regards Björn Lindqvist -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
[Factor-talk] Web scraping
Hello again :) I'm looking at implemented options of scraping web pages? I've hit into this http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html but that's a json output and I'm looking at pages that only have html. I see there's parse-html and scrape-html to parse a url into a vector, which seems like an html tree flattened to an (event) stream. I'm left to wonder about the choice as it is unusual to my eyes, but I found there's a bunch of words working with the output in html.parser.analyzer. I've fiddled around with it and found my way around to extract some components I was looking for. So now I'm wondering - is there anything else I've missed. Is there something that parses html into a tree structure? Is there some simpler DSL to extract data? The common cases I hit into are XPath and CSS selectors, which are short and to the point, but I'm fine with w/e that is easy enough and has the same power. So basically I'm just looking for more tips or options in case I missed something. You guys have a lot of vocabs :) -- Peter Nagy -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
The *.extras vocabularies are places to incubate new words. We haven't done the best job of documenting them and promoting to the core/basis vocabularies. > On Nov 18, 2016, at 8:21 AM, Alexander Ilinwrote: > > Hello, Björn! > > 18.11.2016, 18:25, "Björn Lindqvist" : >> USE: sequences.extras >> [ id>> ] sort-with [ id>> ] group-by [ second first ] map > > I could not find `group-by` using the Browser. Grepping the source tree, it > turned up in `grouping.extras`. > >> USE: math.statistics >> [ id>> ] collect-by [ nip first ] { } assoc>map > > `collect-by` is a useful thing, got to keep it in mind. I remember > implementing something very similar not too long ago. > >> It's not as efficient as what John committed though. :) Maybe we >> should try and clean it up somehow? If we put all group >> by/aggregation/uniquifying words in the same vocab it would be more >> easily discoverable? > > That may be a good idea. I'm regularly rereading the documentation for > `sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at > least *.extras) could be made. > > ---=--- > Александр > > -- > ___ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
Hello, Björn! 18.11.2016, 18:25, "Björn Lindqvist": > USE: sequences.extras > [ id>> ] sort-with [ id>> ] group-by [ second first ] map I could not find `group-by` using the Browser. Grepping the source tree, it turned up in `grouping.extras`. > USE: math.statistics > [ id>> ] collect-by [ nip first ] { } assoc>map `collect-by` is a useful thing, got to keep it in mind. I remember implementing something very similar not too long ago. > It's not as efficient as what John committed though. :) Maybe we > should try and clean it up somehow? If we put all group > by/aggregation/uniquifying words in the same vocab it would be more > easily discoverable? That may be a good idea. I'm regularly rereading the documentation for `sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at least *.extras) could be made. ---=--- Александр -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
John, thank you very much! : )Really helpful stuff! 18.11.2016, 18:13, "John Benediktsson":P.S., Hah I should have called it unique-by, it's too early in the morning! P.P.S., I committed this word into sets.extras, with one small change besides the name which is to size the hash-set capacity by the length of the sequence. On Fri, Nov 18, 2016 at 6:54 AM, John Benediktsson wrote:Maybe something like this: : duplicates-by ( seq quot: ( elt -- key ) -- seq' ) HS{ } clone '[ @ _ ?adjoin ] filter ; inline Then you can use it: IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by { 1 2 4 } IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by It would keep the first element that matches by key and drop all the subsequent ones. On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilin wrote:Hello, all! I have an interesting little task for you today. Let's say you have a sequence of tuples, and you want to remove all tuples with duplicate ids, so that in the new sequence there is only one tuple with each id. Here's my solution:TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence ) dup [ hash>> ] map >hash-set [ [ hash>> ] dip [ in? ] [ delete ] 2bi ] curry filter ; This is not the first time I'm solving this task, and I begun to wonder - is there something similar in the Factor library? Is this the simplest/most efficient implementation? Is it possible to generalize it to work for any slot like so:TYPED: dedupe-by-slot ( seq slot -- seq ) ? If this code is not in the standard library, how about adding it? Seems pretty useful, and not too trivial. What do you say?---=--- Александр--___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk,--,___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk ---=---Александр -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
2016-11-18 15:36 GMT+01:00 Alexander Ilin: > Hello, all! > > I have an interesting little task for you today. > > Let's say you have a sequence of tuples, and you want to remove all tuples > with duplicate ids, so that in the new sequence there is only one tuple with > each id. > > Here's my solution: > > TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence ) > dup [ hash>> ] map >hash-set [ > [ hash>> ] dip > [ in? ] [ delete ] 2bi > ] curry filter ; > > This is not the first time I'm solving this task, and I begun to wonder - > is there something similar in the Factor library? Everything is in the Factor library. :) What you are describing is like a group by operation in sql. So if you have: TUPLE: person name id ; You can use either: USE: sequences.extras [ id>> ] sort-with [ id>> ] group-by [ second first ] map Or USE: math.statistics [ id>> ] collect-by [ nip first ] { } assoc>map If you want tiebreakers, like choosing the person with the alphabetically first name if more than one share id, you can implement it like this: USE: slots.syntax [ slots{ id name } ] sort-with [ id>> ] group-by [ second first ] map It's not as efficient as what John committed though. :) Maybe we should try and clean it up somehow? If we put all group by/aggregation/uniquifying words in the same vocab it would be more easily discoverable? -- mvh Björn Lindqvist -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] Dedupe by Slot
Maybe something like this: : duplicates-by ( seq quot: ( elt -- key ) -- seq' ) HS{ } clone '[ @ _ ?adjoin ] filter ; inline Then you can use it: IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by { 1 2 4 } IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by It would keep the first element that matches by key and drop all the subsequent ones. On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilinwrote: > Hello, all! > > I have an interesting little task for you today. > > Let's say you have a sequence of tuples, and you want to remove all > tuples with duplicate ids, so that in the new sequence there is only one > tuple with each id. > > Here's my solution: > > TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence ) > dup [ hash>> ] map >hash-set [ > [ hash>> ] dip > [ in? ] [ delete ] 2bi > ] curry filter ; > > This is not the first time I'm solving this task, and I begun to wonder > - is there something similar in the Factor library? > > Is this the simplest/most efficient implementation? > > Is it possible to generalize it to work for any slot like so: > > TYPED: dedupe-by-slot ( seq slot -- seq ) ? > > If this code is not in the standard library, how about adding it? Seems > pretty useful, and not too trivial. > > What do you say? > > ---=--- > Александр > > > -- > ___ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk > -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
[Factor-talk] Dedupe by Slot
Hello, all! I have an interesting little task for you today. Let's say you have a sequence of tuples, and you want to remove all tuples with duplicate ids, so that in the new sequence there is only one tuple with each id. Here's my solution: TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence ) dup [ hash>> ] map >hash-set [ [ hash>> ] dip [ in? ] [ delete ] 2bi ] curry filter ; This is not the first time I'm solving this task, and I begun to wonder - is there something similar in the Factor library? Is this the simplest/most efficient implementation? Is it possible to generalize it to work for any slot like so: TYPED: dedupe-by-slot ( seq slot -- seq ) ? If this code is not in the standard library, how about adding it? Seems pretty useful, and not too trivial. What do you say? ---=--- Александр -- ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk