Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2011-03-20 Thread Gwern Branwen
On Fri, Apr 30, 2010 at 12:02 PM, Gwern Branwen  wrote:
> On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen
>  wrote:
>> On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen  wrote:
>>> Nothing in http://develop.github.com/ seems especially useful for
>>> grabbing the git:// URLs of all repos by language - just by user.
>>>
>>> The only real list of repos by language seems to be gotten at via
>>> http://github.com/languages/Haskell/updated or
>>> http://github.com/languages/Haskell/created . (You might think
>>> http://github.com/languages/Haskell would be good, but no, it's just a
>>> few random repos by interest and not a full listing.)
>>
>> Github has a REST API for accessing data. Unfortunately it can't give
>> you the wanted
>> breakdown, but I would ask them for it. It is much simpler for you,
>
> You mean ask for a new feature? (Just a one-time list is no good since
> I intend to repeat it regularly to pick up new repos, just like with
> patch-tag.)
>
>> and it does not put an extra strain on their servers due to the
>> scraping.
>
> Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The
> downloading of the repos would probably reduce that demand to
> insignificance, especially the first time around when most of the
> repos would need to be downloaded.
>
>> Usually, the github guys are helpful when you have a
>> question.

Ultimately, they never did anything about it:
http://support.github.com/discussions/email/6782-contact-extending-api-to-easily-get-list-of-repos-by-language

So I wrote a TagSoup scraper; then I wrote a long tutorial explaining
how I wrote it, step by step.

1. my tutorial: http://www.gwern.net/haskell/Archiving%20GitHub.html
2. the script itself:
http://www.gwern.net/haskell/Archiving%20GitHub.html#the-script
3. Reddit submission of #1 for those who prefer to comment there:
http://www.reddit.com/r/haskell/comments/g7na5/writing_a_haskell_script_to_download_github/

(While writing the tutorial, I tweaked the script code, so I'm not
100% confident that it still works - it uses too much GitHub bandwidth
(and local disk space) for me to re-run it just to see whether it
still works. So if anyone does run it, I would appreciate knowing
whether it still works.)

-- 
gwern
http://www.gwern.net

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Jesper Louis Andersen
On Fri, Apr 30, 2010 at 6:02 PM, Gwern Branwen  wrote:
>>
>> Github has a REST API for accessing data. Unfortunately it can't give
>> you the wanted
>> breakdown, but I would ask them for it. It is much simpler for you,
>
> You mean ask for a new feature? (Just a one-time list is no good since
> I intend to repeat it regularly to pick up new repos, just like with
> patch-tag.)
>

Yes.

> Any suggested method besides the obvious http://github.com/contact ?

No.


-- 
J.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Gwern Branwen
On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen
 wrote:
> On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen  wrote:
>> Nothing in http://develop.github.com/ seems especially useful for
>> grabbing the git:// URLs of all repos by language - just by user.
>>
>> The only real list of repos by language seems to be gotten at via
>> http://github.com/languages/Haskell/updated or
>> http://github.com/languages/Haskell/created . (You might think
>> http://github.com/languages/Haskell would be good, but no, it's just a
>> few random repos by interest and not a full listing.)
>
> Github has a REST API for accessing data. Unfortunately it can't give
> you the wanted
> breakdown, but I would ask them for it. It is much simpler for you,

You mean ask for a new feature? (Just a one-time list is no good since
I intend to repeat it regularly to pick up new repos, just like with
patch-tag.)

> and it does not put an extra strain on their servers due to the
> scraping.

Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The
downloading of the repos would probably reduce that demand to
insignificance, especially the first time around when most of the
repos would need to be downloaded.

> Usually, the github guys are helpful when you have a
> question.

Any suggested method besides the obvious http://github.com/contact ?

-- 
gwern
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Jesper Louis Andersen
On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen  wrote:
> Nothing in http://develop.github.com/ seems especially useful for
> grabbing the git:// URLs of all repos by language - just by user.
>
> The only real list of repos by language seems to be gotten at via
> http://github.com/languages/Haskell/updated or
> http://github.com/languages/Haskell/created . (You might think
> http://github.com/languages/Haskell would be good, but no, it's just a
> few random repos by interest and not a full listing.)

Github has a REST API for accessing data. Unfortunately it can't give
you the wanted
breakdown, but I would ask them for it. It is much simpler for you,
and it does not put an extra strain on their servers due to the
scraping. Usually, the github guys are helpful when you have a
question.

-- 
J.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Gwern Branwen
Along the lines of
http://blog.patch-tag.com/2010/03/13/mirroring-patch-tag/ for
downloading all patch-tag.com repositories, I've begun to wonder how
to download all Github repositories since more and more people seem to
be using it.

Nothing in http://develop.github.com/ seems especially useful for
grabbing the git:// URLs of all repos by language - just by user.

The only real list of repos by language seems to be gotten at via
http://github.com/languages/Haskell/updated or
http://github.com/languages/Haskell/created . (You might think
http://github.com/languages/Haskell would be good, but no, it's just a
few random repos by interest and not a full listing.)

I looked at the HTML, and it looks possible to use tagsoup to get all
98 pages and then parse the entries to get the HTTP URLs of the repos,
and then turn *that* into git:// URLs suitable for shelling out to
'git clone', but I can't help but wonder if maybe there's a better
approach someone more familiar with Github would know.

-- 
gwern
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe