Re: [Haskell-cafe] Downloading Haskell repos from GitHub
On Fri, Apr 30, 2010 at 12:02 PM, Gwern Branwen wrote: > On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen > wrote: >> On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen wrote: >>> Nothing in http://develop.github.com/ seems especially useful for >>> grabbing the git:// URLs of all repos by language - just by user. >>> >>> The only real list of repos by language seems to be gotten at via >>> http://github.com/languages/Haskell/updated or >>> http://github.com/languages/Haskell/created . (You might think >>> http://github.com/languages/Haskell would be good, but no, it's just a >>> few random repos by interest and not a full listing.) >> >> Github has a REST API for accessing data. Unfortunately it can't give >> you the wanted >> breakdown, but I would ask them for it. It is much simpler for you, > > You mean ask for a new feature? (Just a one-time list is no good since > I intend to repeat it regularly to pick up new repos, just like with > patch-tag.) > >> and it does not put an extra strain on their servers due to the >> scraping. > > Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The > downloading of the repos would probably reduce that demand to > insignificance, especially the first time around when most of the > repos would need to be downloaded. > >> Usually, the github guys are helpful when you have a >> question. Ultimately, they never did anything about it: http://support.github.com/discussions/email/6782-contact-extending-api-to-easily-get-list-of-repos-by-language So I wrote a TagSoup scraper; then I wrote a long tutorial explaining how I wrote it, step by step. 1. my tutorial: http://www.gwern.net/haskell/Archiving%20GitHub.html 2. the script itself: http://www.gwern.net/haskell/Archiving%20GitHub.html#the-script 3. Reddit submission of #1 for those who prefer to comment there: http://www.reddit.com/r/haskell/comments/g7na5/writing_a_haskell_script_to_download_github/ (While writing the tutorial, I tweaked the script code, so I'm not 100% confident that it still works - it uses too much GitHub bandwidth (and local disk space) for me to re-run it just to see whether it still works. So if anyone does run it, I would appreciate knowing whether it still works.) -- gwern http://www.gwern.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading Haskell repos from GitHub
On Fri, Apr 30, 2010 at 6:02 PM, Gwern Branwen wrote: >> >> Github has a REST API for accessing data. Unfortunately it can't give >> you the wanted >> breakdown, but I would ask them for it. It is much simpler for you, > > You mean ask for a new feature? (Just a one-time list is no good since > I intend to repeat it regularly to pick up new repos, just like with > patch-tag.) > Yes. > Any suggested method besides the obvious http://github.com/contact ? No. -- J. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading Haskell repos from GitHub
On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen wrote: > On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen wrote: >> Nothing in http://develop.github.com/ seems especially useful for >> grabbing the git:// URLs of all repos by language - just by user. >> >> The only real list of repos by language seems to be gotten at via >> http://github.com/languages/Haskell/updated or >> http://github.com/languages/Haskell/created . (You might think >> http://github.com/languages/Haskell would be good, but no, it's just a >> few random repos by interest and not a full listing.) > > Github has a REST API for accessing data. Unfortunately it can't give > you the wanted > breakdown, but I would ask them for it. It is much simpler for you, You mean ask for a new feature? (Just a one-time list is no good since I intend to repeat it regularly to pick up new repos, just like with patch-tag.) > and it does not put an extra strain on their servers due to the > scraping. Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The downloading of the repos would probably reduce that demand to insignificance, especially the first time around when most of the repos would need to be downloaded. > Usually, the github guys are helpful when you have a > question. Any suggested method besides the obvious http://github.com/contact ? -- gwern ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading Haskell repos from GitHub
On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen wrote: > Nothing in http://develop.github.com/ seems especially useful for > grabbing the git:// URLs of all repos by language - just by user. > > The only real list of repos by language seems to be gotten at via > http://github.com/languages/Haskell/updated or > http://github.com/languages/Haskell/created . (You might think > http://github.com/languages/Haskell would be good, but no, it's just a > few random repos by interest and not a full listing.) Github has a REST API for accessing data. Unfortunately it can't give you the wanted breakdown, but I would ask them for it. It is much simpler for you, and it does not put an extra strain on their servers due to the scraping. Usually, the github guys are helpful when you have a question. -- J. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Downloading Haskell repos from GitHub
Along the lines of http://blog.patch-tag.com/2010/03/13/mirroring-patch-tag/ for downloading all patch-tag.com repositories, I've begun to wonder how to download all Github repositories since more and more people seem to be using it. Nothing in http://develop.github.com/ seems especially useful for grabbing the git:// URLs of all repos by language - just by user. The only real list of repos by language seems to be gotten at via http://github.com/languages/Haskell/updated or http://github.com/languages/Haskell/created . (You might think http://github.com/languages/Haskell would be good, but no, it's just a few random repos by interest and not a full listing.) I looked at the HTML, and it looks possible to use tagsoup to get all 98 pages and then parse the entries to get the HTTP URLs of the repos, and then turn *that* into git:// URLs suitable for shelling out to 'git clone', but I can't help but wonder if maybe there's a better approach someone more familiar with Github would know. -- gwern ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe