Re: [Factor-talk] Web scraping

2016-11-18 Thread Björn Lindqvist
I think the reason it is parsed into a vector of start and end tags is
because it is much simpler when not all of the html data is available.
Or you are dealing with broken html code. There is no real XPath
support in any Factor vocab as far as I'm aware of. I once wrote a
half-completed binding for libxml2 (which has XPath support and a lot
of other goodies) when I also needed it, but then I got side-tracked
with other things I wanted to build. And the words in
html.parser.analyzer were "good enough" for my use case. It's not so
hard to use them to do the same kind of querying you would with XPath.

So for example, if you have the result of
"https://news.ycombinator.com/; scrape-html nip on the stack:

//a//text() ->
[ name>> "a" = ] find-between-all [ [ name>> text = ] filter
[ text>> ] map " " join ] map

//@href ->
[ "href" attribute ] map sift

//table[@class="itemlist"]/td[@class="storylink"]/(text() or @href) ->
[ "itemlist" html-class? ] find-between-all first
[ "storylink" html-class? ] find-between-all
[ [ first "href" attribute ] [ second text>> ] bi 2array ] map

XPath expressions look better, but this works just fine.

2016-11-19 0:32 GMT+01:00  :
> Hello again :)
>
> I'm looking at implemented options of scraping web pages? I've hit into
> this
>
> http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html
>
> but that's a json output and I'm looking at pages that only have html. I
> see there's parse-html and scrape-html to parse a url into a vector,
> which seems like an html tree flattened to an (event) stream. I'm left
> to wonder about the choice as it is unusual to my eyes, but I found
> there's a bunch of words working with the output in
> html.parser.analyzer. I've fiddled around with it and found my way
> around to extract some components I was looking for.
>
> So now I'm wondering - is there anything else I've missed. Is there
> something that parses html into a tree structure? Is there some simpler
> DSL to extract data? The common cases I hit into are XPath and CSS
> selectors, which are short and to the point, but I'm fine with w/e that
> is easy enough and has the same power. So basically I'm just looking for
> more tips or options in case I missed something. You guys have a lot of
> vocabs :)
>
> --
> 
>Peter Nagy
> 
>
> --
> ___
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk



-- 
mvh/best regards Björn Lindqvist

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk


[Factor-talk] Web scraping

2016-11-18 Thread petern
Hello again :)

I'm looking at implemented options of scraping web pages? I've hit into 
this

http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html

but that's a json output and I'm looking at pages that only have html. I 
see there's parse-html and scrape-html to parse a url into a vector, 
which seems like an html tree flattened to an (event) stream. I'm left 
to wonder about the choice as it is unusual to my eyes, but I found 
there's a bunch of words working with the output in 
html.parser.analyzer. I've fiddled around with it and found my way 
around to extract some components I was looking for.

So now I'm wondering - is there anything else I've missed. Is there 
something that parses html into a tree structure? Is there some simpler 
DSL to extract data? The common cases I hit into are XPath and CSS 
selectors, which are short and to the point, but I'm fine with w/e that 
is easy enough and has the same power. So basically I'm just looking for 
more tips or options in case I missed something. You guys have a lot of 
vocabs :)

-- 

   Peter Nagy


--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk


Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread John Benediktsson
The *.extras vocabularies are places to incubate new words. We haven't done the 
best job of documenting them and promoting to the core/basis vocabularies. 

> On Nov 18, 2016, at 8:21 AM, Alexander Ilin  wrote:
> 
> Hello, Björn!
> 
> 18.11.2016, 18:25, "Björn Lindqvist" :
>> USE: sequences.extras
>> [ id>> ] sort-with [ id>> ] group-by [ second first ] map
> 
>  I could not find `group-by` using the Browser. Grepping the source tree, it 
> turned up in `grouping.extras`.
> 
>> USE: math.statistics
>> [ id>> ] collect-by [ nip first ] { } assoc>map
> 
>  `collect-by` is a useful thing, got to keep it in mind. I remember 
> implementing something very similar not too long ago.
> 
>> It's not as efficient as what John committed though. :) Maybe we
>> should try and clean it up somehow? If we put all group
>> by/aggregation/uniquifying words in the same vocab it would be more
>> easily discoverable?
> 
>  That may be a good idea. I'm regularly rereading the documentation for 
> `sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at 
> least *.extras) could be made.
> 
> ---=--- 
> Александр
> 
> --
> ___
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk


Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread Alexander Ilin
Hello, Björn!

18.11.2016, 18:25, "Björn Lindqvist" :
> USE: sequences.extras
> [ id>> ] sort-with [ id>> ] group-by [ second first ] map

  I could not find `group-by` using the Browser. Grepping the source tree, it 
turned up in `grouping.extras`.

> USE: math.statistics
> [ id>> ] collect-by [ nip first ] { } assoc>map

  `collect-by` is a useful thing, got to keep it in mind. I remember 
implementing something very similar not too long ago.

> It's not as efficient as what John committed though. :) Maybe we
> should try and clean it up somehow? If we put all group
> by/aggregation/uniquifying words in the same vocab it would be more
> easily discoverable?

  That may be a good idea. I'm regularly rereading the documentation for 
`sequences` and `sets`, so maybe a little pointer to adjacent vocabs (at least 
*.extras) could be made.

---=--- 
 Александр

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk


Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread Alexander Ilin
John, thank you very much! : )Really helpful stuff! 18.11.2016, 18:13, "John Benediktsson" :P.S., Hah I should have called it unique-by, it's too early in the morning! P.P.S., I committed this word into sets.extras, with one small change besides the name which is to size the hash-set capacity by the length of the sequence. On Fri, Nov 18, 2016 at 6:54 AM, John Benediktsson  wrote:Maybe something like this: : duplicates-by ( seq quot: ( elt -- key ) -- seq' )    HS{ } clone '[ @ _ ?adjoin ] filter ; inline Then you can use it:     IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by    { 1 2 4 }     IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by It would keep the first element that matches by key and drop all the subsequent ones.   On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilin  wrote:Hello, all!  I have an interesting little task for you today.  Let's say you have a sequence of tuples, and you want to remove all tuples with duplicate ids, so that in the new sequence there is only one tuple with each id.  Here's my solution:TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence )    dup [ hash>> ] map >hash-set [        [ hash>> ] dip        [ in? ] [ delete ] 2bi    ] curry filter ;  This is not the first time I'm solving this task, and I begun to wonder - is there something similar in the Factor library?  Is this the simplest/most efficient implementation?  Is it possible to generalize it to work for any slot like so:TYPED: dedupe-by-slot ( seq slot -- seq ) ?  If this code is not in the standard library, how about adding it? Seems pretty useful, and not too trivial.  What do you say?---=--- Александр--___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk,--,___Factor-talk mailing listFactor-talk@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/factor-talk  ---=---Александр --
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk


Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread Björn Lindqvist
2016-11-18 15:36 GMT+01:00 Alexander Ilin :
> Hello, all!
>
>   I have an interesting little task for you today.
>
>   Let's say you have a sequence of tuples, and you want to remove all tuples 
> with duplicate ids, so that in the new sequence there is only one tuple with 
> each id.
>
>   Here's my solution:
>
> TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence )
> dup [ hash>> ] map >hash-set [
> [ hash>> ] dip
> [ in? ] [ delete ] 2bi
> ] curry filter ;
>
>   This is not the first time I'm solving this task, and I begun to wonder - 
> is there something similar in the Factor library?

Everything is in the Factor library. :) What you are describing is
like a group by operation in sql. So if you have:

TUPLE: person name id ;

You can use either:

USE: sequences.extras
[ id>> ] sort-with [ id>> ] group-by [ second first ] map

Or

USE: math.statistics
[ id>> ] collect-by [ nip first ] { } assoc>map

If you want tiebreakers, like choosing the person with the
alphabetically first name if more than one share id, you can implement
it like this:

USE: slots.syntax
[ slots{ id name } ] sort-with [ id>> ] group-by [ second first ] map

It's not as efficient as what John committed though. :) Maybe we
should try and clean it up somehow? If we put all group
by/aggregation/uniquifying words in the same vocab it would be more
easily discoverable?


--
mvh Björn Lindqvist

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk


Re: [Factor-talk] Dedupe by Slot

2016-11-18 Thread John Benediktsson
Maybe something like this:

: duplicates-by ( seq quot: ( elt -- key ) -- seq' )
HS{ } clone '[ @ _ ?adjoin ] filter ; inline

Then you can use it:

IN: scratchpad { 1 2 3 4 5 } [ 2/ ] duplicates-by
{ 1 2 4 }

IN: scratchpad sequence-of-tuples [ hash>> ] duplicates-by

It would keep the first element that matches by key and drop all the
subsequent ones.



On Fri, Nov 18, 2016 at 6:36 AM, Alexander Ilin  wrote:

> Hello, all!
>
>   I have an interesting little task for you today.
>
>   Let's say you have a sequence of tuples, and you want to remove all
> tuples with duplicate ids, so that in the new sequence there is only one
> tuple with each id.
>
>   Here's my solution:
>
> TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence )
> dup [ hash>> ] map >hash-set [
> [ hash>> ] dip
> [ in? ] [ delete ] 2bi
> ] curry filter ;
>
>   This is not the first time I'm solving this task, and I begun to wonder
> - is there something similar in the Factor library?
>
>   Is this the simplest/most efficient implementation?
>
>   Is it possible to generalize it to work for any slot like so:
>
> TYPED: dedupe-by-slot ( seq slot -- seq ) ?
>
>   If this code is not in the standard library, how about adding it? Seems
> pretty useful, and not too trivial.
>
>   What do you say?
>
> ---=---
>  Александр
>
> 
> --
> ___
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk
>
--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk


[Factor-talk] Dedupe by Slot

2016-11-18 Thread Alexander Ilin
Hello, all!

  I have an interesting little task for you today.

  Let's say you have a sequence of tuples, and you want to remove all tuples 
with duplicate ids, so that in the new sequence there is only one tuple with 
each id.

  Here's my solution:

TYPED: dedupe-by-hash ( seq: sequence -- seq: sequence )
dup [ hash>> ] map >hash-set [
[ hash>> ] dip
[ in? ] [ delete ] 2bi
] curry filter ;

  This is not the first time I'm solving this task, and I begun to wonder - is 
there something similar in the Factor library?

  Is this the simplest/most efficient implementation?

  Is it possible to generalize it to work for any slot like so:

TYPED: dedupe-by-slot ( seq slot -- seq ) ?

  If this code is not in the standard library, how about adding it? Seems 
pretty useful, and not too trivial.

  What do you say?

---=--- 
 Александр

--
___
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk