Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2011-03-20 Thread Gwern Branwen
On Fri, Apr 30, 2010 at 12:02 PM, Gwern Branwen gwe...@gmail.com wrote:
 On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen
 jesper.louis.ander...@gmail.com wrote:
 On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen gwe...@gmail.com wrote:
 Nothing in http://develop.github.com/ seems especially useful for
 grabbing the git:// URLs of all repos by language - just by user.

 The only real list of repos by language seems to be gotten at via
 http://github.com/languages/Haskell/updated or
 http://github.com/languages/Haskell/created . (You might think
 http://github.com/languages/Haskell would be good, but no, it's just a
 few random repos by interest and not a full listing.)

 Github has a REST API for accessing data. Unfortunately it can't give
 you the wanted
 breakdown, but I would ask them for it. It is much simpler for you,

 You mean ask for a new feature? (Just a one-time list is no good since
 I intend to repeat it regularly to pick up new repos, just like with
 patch-tag.)

 and it does not put an extra strain on their servers due to the
 scraping.

 Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The
 downloading of the repos would probably reduce that demand to
 insignificance, especially the first time around when most of the
 repos would need to be downloaded.

 Usually, the github guys are helpful when you have a
 question.

Ultimately, they never did anything about it:
http://support.github.com/discussions/email/6782-contact-extending-api-to-easily-get-list-of-repos-by-language

So I wrote a TagSoup scraper; then I wrote a long tutorial explaining
how I wrote it, step by step.

1. my tutorial: http://www.gwern.net/haskell/Archiving%20GitHub.html
2. the script itself:
http://www.gwern.net/haskell/Archiving%20GitHub.html#the-script
3. Reddit submission of #1 for those who prefer to comment there:
http://www.reddit.com/r/haskell/comments/g7na5/writing_a_haskell_script_to_download_github/

(While writing the tutorial, I tweaked the script code, so I'm not
100% confident that it still works - it uses too much GitHub bandwidth
(and local disk space) for me to re-run it just to see whether it
still works. So if anyone does run it, I would appreciate knowing
whether it still works.)

-- 
gwern
http://www.gwern.net

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Gwern Branwen
Along the lines of
http://blog.patch-tag.com/2010/03/13/mirroring-patch-tag/ for
downloading all patch-tag.com repositories, I've begun to wonder how
to download all Github repositories since more and more people seem to
be using it.

Nothing in http://develop.github.com/ seems especially useful for
grabbing the git:// URLs of all repos by language - just by user.

The only real list of repos by language seems to be gotten at via
http://github.com/languages/Haskell/updated or
http://github.com/languages/Haskell/created . (You might think
http://github.com/languages/Haskell would be good, but no, it's just a
few random repos by interest and not a full listing.)

I looked at the HTML, and it looks possible to use tagsoup to get all
98 pages and then parse the entries to get the HTTP URLs of the repos,
and then turn *that* into git:// URLs suitable for shelling out to
'git clone', but I can't help but wonder if maybe there's a better
approach someone more familiar with Github would know.

-- 
gwern
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Jesper Louis Andersen
On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen gwe...@gmail.com wrote:
 Nothing in http://develop.github.com/ seems especially useful for
 grabbing the git:// URLs of all repos by language - just by user.

 The only real list of repos by language seems to be gotten at via
 http://github.com/languages/Haskell/updated or
 http://github.com/languages/Haskell/created . (You might think
 http://github.com/languages/Haskell would be good, but no, it's just a
 few random repos by interest and not a full listing.)

Github has a REST API for accessing data. Unfortunately it can't give
you the wanted
breakdown, but I would ask them for it. It is much simpler for you,
and it does not put an extra strain on their servers due to the
scraping. Usually, the github guys are helpful when you have a
question.

-- 
J.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Gwern Branwen
On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen
jesper.louis.ander...@gmail.com wrote:
 On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen gwe...@gmail.com wrote:
 Nothing in http://develop.github.com/ seems especially useful for
 grabbing the git:// URLs of all repos by language - just by user.

 The only real list of repos by language seems to be gotten at via
 http://github.com/languages/Haskell/updated or
 http://github.com/languages/Haskell/created . (You might think
 http://github.com/languages/Haskell would be good, but no, it's just a
 few random repos by interest and not a full listing.)

 Github has a REST API for accessing data. Unfortunately it can't give
 you the wanted
 breakdown, but I would ask them for it. It is much simpler for you,

You mean ask for a new feature? (Just a one-time list is no good since
I intend to repeat it regularly to pick up new repos, just like with
patch-tag.)

 and it does not put an extra strain on their servers due to the
 scraping.

Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The
downloading of the repos would probably reduce that demand to
insignificance, especially the first time around when most of the
repos would need to be downloaded.

 Usually, the github guys are helpful when you have a
 question.

Any suggested method besides the obvious http://github.com/contact ?

-- 
gwern
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading Haskell repos from GitHub

2010-04-30 Thread Jesper Louis Andersen
On Fri, Apr 30, 2010 at 6:02 PM, Gwern Branwen gwe...@gmail.com wrote:

 Github has a REST API for accessing data. Unfortunately it can't give
 you the wanted
 breakdown, but I would ask them for it. It is much simpler for you,

 You mean ask for a new feature? (Just a one-time list is no good since
 I intend to repeat it regularly to pick up new repos, just like with
 patch-tag.)


Yes.

 Any suggested method besides the obvious http://github.com/contact ?

No.


-- 
J.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe