Re: How to get access to ALL the data in maven central?
Answered my own question to a degree. For the benefit of the group here is how to do it: rsync -a -v --include */ --include *.pom --include *.xml --exclude * --bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2 That will retrieve all of the pom and xml metadata files for the maven central repository. At first I tried to just do a full rsync, but ibiblio cut me off after about 3.4G of transfer. After an hour or so they let me back in, hence the bwlimit of 1000KB/s to attempt to not hog their bandwidth. Unfortunately they don't seem to publish what their limits are so I guess I'll have to play with it to see how long it takes me to get the all the data. After I get all the poms I'll start in on the full repository via a slow slurp. I'm OK with it taking weeks to get the jars for the first sync, and then once I have the full repo getting the updates shouldn't be so taxing. Progress! Matt On Mon, Apr 9, 2012 at 9:20 PM, Matt Taylor m...@matthewjosephtaylor.comwrote: Perhaps this is already in existence somewhere. If so please point me in the right direction. I want to know what the most popular dependancies are, not based on downloads, but based on dependancies from other projects. I want to explore the full dependency graph and see its evolution over 'time' (for instance seeing how fast versions of artifacts are adopted). I want to create a visual representations of all the dependancies just because it would look cool. In general I want total access to all the metadata (pom files essentially) in the maven central repo, so I can see how the worlds software fits together on a 'global' scale. Eventually I would like to explore the jar artifacts as well to get deeper insights into what methods/classes are being referenced as well, but that is phase 2. :) From googling around is appears that understandably it is improper to simply wget the entire repo. However, there don't seem to be any publicly available torrents, or other resources for me to get access to this data. http://search.maven.org/#stats 457GB is a lot of data, but it isn't an unimaginable amount, and most of that is no doubt the artifacts, not the metadata (pom files). So I really have two questions: 1. What is the easiest path to getting rsync type access of the full repo (I'd quite understand if I needed to pay a fee for this level of access). 2. Failing that, what would be a legitimate way of just getting all the pom files? Basically I want to be a good guy and not put undo load on the servers, but at the same time I really want the data. Thanks, Matt Taylor http://blog.matthewjosephtaylor.com
Re: How to get access to ALL the data in maven central?
On Tue, Apr 10, 2012 at 3:42 PM, Matt Taylor m...@matthewjosephtaylor.com wrote: Answered my own question to a degree. For the benefit of the group here is how to do it: rsync -a -v --include */ --include *.pom --include *.xml --exclude * --bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2 That will retrieve all of the pom and xml metadata files for the maven central repository. At first I tried to just do a full rsync, but ibiblio cut me off after about 3.4G of transfer. After an hour or so they let me back in, hence the bwlimit of 1000KB/s to attempt to not hog their bandwidth. Unfortunately they don't seem to publish what their limits are so I guess I'll have to play with it to see how long it takes me to get the all the data. After I get all the poms I'll start in on the full repository via a slow slurp. I'm OK with it taking weeks to get the jars for the first sync, and then once I have the full repo getting the updates shouldn't be so taxing. You dont want to get the jar files. They aren't going to tell you anything. - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org
Re: How to get access to ALL the data in maven central?
Actually I think the jars are going to tell me quite a bit. By looking into the class files I should be able to create a link between not only what dependancies are being used by what projects but what methods/classes are being used within each dependency as well. I can then for instance create a 'heat map' for each project to show what classes/methods are most used within that project. Matt On Tue, Apr 10, 2012 at 1:14 AM, Barrie Treloar baerr...@gmail.com wrote: On Tue, Apr 10, 2012 at 3:42 PM, Matt Taylor m...@matthewjosephtaylor.com wrote: Answered my own question to a degree. For the benefit of the group here is how to do it: rsync -a -v --include */ --include *.pom --include *.xml --exclude * --bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2 That will retrieve all of the pom and xml metadata files for the maven central repository. At first I tried to just do a full rsync, but ibiblio cut me off after about 3.4G of transfer. After an hour or so they let me back in, hence the bwlimit of 1000KB/s to attempt to not hog their bandwidth. Unfortunately they don't seem to publish what their limits are so I guess I'll have to play with it to see how long it takes me to get the all the data. After I get all the poms I'll start in on the full repository via a slow slurp. I'm OK with it taking weeks to get the jars for the first sync, and then once I have the full repo getting the updates shouldn't be so taxing. You dont want to get the jar files. They aren't going to tell you anything. - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org
Re: How to get access to ALL the data in maven central?
If you wanted to scrape Maven Central for just the poms then I'd contact Sonatype who manage the central repository. As Barrie said, you could talk to Sonatype (Brian specifically) since they operate the Maven Central repo and they might be able to make a zip file available that would be the result of tar'ing all the pom files (no artifacts) in Central. I know you have a solution with rsync but this might save some time. Alternatively you could run your own local Repo Manager (Archiva, Artifactory, Nexus) which would cache all the artifacts and poms. The Aether API might be a useful thing to look at as well. You may be able to specify just pull down the pom file and not the jar in the API at least for the first pass, then decide if you want the jars as well for a second pass later. Wayne - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org
Re: How to get access to ALL the data in maven central?
Make a request here and I can attach the poms for you: https://issues.sonatype.org/browse/MVNCENTRAL On Tue, Apr 10, 2012 at 1:17 PM, Wayne Fay wayne...@gmail.com wrote: If you wanted to scrape Maven Central for just the poms then I'd contact Sonatype who manage the central repository. As Barrie said, you could talk to Sonatype (Brian specifically) since they operate the Maven Central repo and they might be able to make a zip file available that would be the result of tar'ing all the pom files (no artifacts) in Central. I know you have a solution with rsync but this might save some time. Alternatively you could run your own local Repo Manager (Archiva, Artifactory, Nexus) which would cache all the artifacts and poms. The Aether API might be a useful thing to look at as well. You may be able to specify just pull down the pom file and not the jar in the API at least for the first pass, then decide if you want the jars as well for a second pass later. Wayne - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org
How to get access to ALL the data in maven central?
Perhaps this is already in existence somewhere. If so please point me in the right direction. I want to know what the most popular dependancies are, not based on downloads, but based on dependancies from other projects. I want to explore the full dependency graph and see its evolution over 'time' (for instance seeing how fast versions of artifacts are adopted). I want to create a visual representations of all the dependancies just because it would look cool. In general I want total access to all the metadata (pom files essentially) in the maven central repo, so I can see how the worlds software fits together on a 'global' scale. Eventually I would like to explore the jar artifacts as well to get deeper insights into what methods/classes are being referenced as well, but that is phase 2. :) From googling around is appears that understandably it is improper to simply wget the entire repo. However, there don't seem to be any publicly available torrents, or other resources for me to get access to this data. http://search.maven.org/#stats 457GB is a lot of data, but it isn't an unimaginable amount, and most of that is no doubt the artifacts, not the metadata (pom files). So I really have two questions: 1. What is the easiest path to getting rsync type access of the full repo (I'd quite understand if I needed to pay a fee for this level of access). 2. Failing that, what would be a legitimate way of just getting all the pom files? Basically I want to be a good guy and not put undo load on the servers, but at the same time I really want the data. Thanks, Matt Taylor http://blog.matthewjosephtaylor.com
Re: How to get access to ALL the data in maven central?
You are going to be missing the key ingredient which is the application POMs that tell you what artifacts are actually used. You might get some interesting information about things like log4j which is probably used by lots of things inside Maven Central. You will be grossly misled about the use of things like CXF since it is hardly ever called by a library that would be submitted to Maven Central but is frequently used by project that are in private repositories. You may be able to visualize a where used between libraries but you will have a lot of nodes that are never used which is not true. You will have to figure out a way to separate projects that are still used and produced a ton of revisions 5 years ago but nothing since, from projects that are mature yet still active but only produce new versions every 18 months since they are stable and work, from projects that were very active and then died as they became unnecessary due to newer technologies being introduced. You will also have trouble with projects that repackage their artifacts between major releases and change the GAV structure by redistributing the functionality. Not sure that your project is going to produce any useful information and I fear that it will be misleading to anyone who does not look deeper into the raw data. Visualization may just make it easier for incorrect conclusions to be developed. Ron On 09/04/2012 10:20 PM, Matt Taylor wrote: Perhaps this is already in existence somewhere. If so please point me in the right direction. I want to know what the most popular dependancies are, not based on downloads, but based on dependancies from other projects. I want to explore the full dependency graph and see its evolution over 'time' (for instance seeing how fast versions of artifacts are adopted). I want to create a visual representations of all the dependancies just because it would look cool. In general I want total access to all the metadata (pom files essentially) in the maven central repo, so I can see how the worlds software fits together on a 'global' scale. Eventually I would like to explore the jar artifacts as well to get deeper insights into what methods/classes are being referenced as well, but that is phase 2. :) From googling around is appears that understandably it is improper to simply wget the entire repo. However, there don't seem to be any publicly available torrents, or other resources for me to get access to this data. http://search.maven.org/#stats 457GB is a lot of data, but it isn't an unimaginable amount, and most of that is no doubt the artifacts, not the metadata (pom files). So I really have two questions: 1. What is the easiest path to getting rsync type access of the full repo (I'd quite understand if I needed to pay a fee for this level of access). 2. Failing that, what would be a legitimate way of just getting all the pom files? Basically I want to be a good guy and not put undo load on the servers, but at the same time I really want the data. Thanks, Matt Taylor http://blog.matthewjosephtaylor.com -- Ron Wheeler President Artifact Software Inc email: rwhee...@artifact-software.com skype: ronaldmwheeler phone: 866-970-2435, ext 102 - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org
Re: How to get access to ALL the data in maven central?
On Tue, Apr 10, 2012 at 12:31 PM, Ron Wheeler rwhee...@artifact-software.com wrote: You are going to be missing the key ingredient which is the application POMs that tell you what artifacts are actually used. You might get some interesting information about things like log4j which is probably used by lots of things inside Maven Central. You will be grossly misled about the use of things like CXF since it is hardly ever called by a library that would be submitted to Maven Central but is frequently used by project that are in private repositories. You may be able to visualize a where used between libraries but you will have a lot of nodes that are never used which is not true. You will have to figure out a way to separate projects that are still used and produced a ton of revisions 5 years ago but nothing since, from projects that are mature yet still active but only produce new versions every 18 months since they are stable and work, from projects that were very active and then died as they became unnecessary due to newer technologies being introduced. You will also have trouble with projects that repackage their artifacts between major releases and change the GAV structure by redistributing the functionality. Not sure that your project is going to produce any useful information and I fear that it will be misleading to anyone who does not look deeper into the raw data. Visualization may just make it easier for incorrect conclusions to be developed. Ron [del] 457GB is a lot of data, but it isn't an unimaginable amount, and most of that is no doubt the artifacts, not the metadata (pom files). [del] Assuming that you listened to Ron's reasoning, but you are going to go ahead anyway. 457GB would be the jar sizes. The pom's themselves wouldn't be that big. Maven Central isn't directly web browsable any more, but you could use the mirror at http://mirrors.ibiblio.org/pub/mirrors/maven2/ If you wanted to scrape Maven Central for just the poms then I'd contact Sonatype who manage the central repository. - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org
Re: How to get access to ALL the data in maven central?
I agree it is definitely going to be imperfect and it will in the end only be a sampling of the real usage, but I think that it will still prove interesting information. As far as bogus conclusions reached from others: I plan on putting in some effort into explaining what the results are, what they mean and making them accessible. Hopefully I'll get it mostly right and/or attract other smarter people who will carry on from me. Time will tell on that one. :) I agree that figuring out the temporal aspects of the graph will be a hard problem (but rewarding as well if I can tease out the evolution of the ecosystem). Version numbers provide a sort of ordering, but it's messy. All in all I think you make some valid points as far as the difficulty, but the challenges are part of what attract me to this. Even if I fail miserably, I'll still learn a ton, and hopefully have some fun along the way. Matt On Mon, Apr 9, 2012 at 10:01 PM, Ron Wheeler rwhee...@artifact-software.com wrote: You are going to be missing the key ingredient which is the application POMs that tell you what artifacts are actually used. You might get some interesting information about things like log4j which is probably used by lots of things inside Maven Central. You will be grossly misled about the use of things like CXF since it is hardly ever called by a library that would be submitted to Maven Central but is frequently used by project that are in private repositories. You may be able to visualize a where used between libraries but you will have a lot of nodes that are never used which is not true. You will have to figure out a way to separate projects that are still used and produced a ton of revisions 5 years ago but nothing since, from projects that are mature yet still active but only produce new versions every 18 months since they are stable and work, from projects that were very active and then died as they became unnecessary due to newer technologies being introduced. You will also have trouble with projects that repackage their artifacts between major releases and change the GAV structure by redistributing the functionality. Not sure that your project is going to produce any useful information and I fear that it will be misleading to anyone who does not look deeper into the raw data. Visualization may just make it easier for incorrect conclusions to be developed. Ron On 09/04/2012 10:20 PM, Matt Taylor wrote: Perhaps this is already in existence somewhere. If so please point me in the right direction. I want to know what the most popular dependancies are, not based on downloads, but based on dependancies from other projects. I want to explore the full dependency graph and see its evolution over 'time' (for instance seeing how fast versions of artifacts are adopted). I want to create a visual representations of all the dependancies just because it would look cool. In general I want total access to all the metadata (pom files essentially) in the maven central repo, so I can see how the worlds software fits together on a 'global' scale. Eventually I would like to explore the jar artifacts as well to get deeper insights into what methods/classes are being referenced as well, but that is phase 2. :) From googling around is appears that understandably it is improper to simply wget the entire repo. However, there don't seem to be any publicly available torrents, or other resources for me to get access to this data. http://search.maven.org/#stats 457GB is a lot of data, but it isn't an unimaginable amount, and most of that is no doubt the artifacts, not the metadata (pom files). So I really have two questions: 1. What is the easiest path to getting rsync type access of the full repo (I'd quite understand if I needed to pay a fee for this level of access). 2. Failing that, what would be a legitimate way of just getting all the pom files? Basically I want to be a good guy and not put undo load on the servers, but at the same time I really want the data. Thanks, Matt Taylor http://blog.**matthewjosephtaylor.comhttp://blog.matthewjosephtaylor.com -- Ron Wheeler President Artifact Software Inc email: rwhee...@artifact-software.com skype: ronaldmwheeler phone: 866-970-2435, ext 102 - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org
Re: How to get access to ALL the data in maven central?
Lol, Ron has valid points but I am indeed going forward (and have only myself to blame). Agreed I just need the pom files which are much smaller, but it is still lots of hits on the web server. If anyone knows of a 'nice' way of getting just the pom files that would be good enough for the moment. As a last resort I suppose I could write something that attempts to slowly slurp them down over time. Anyone have any ideas what a 'reasonable' rate of doing lots of small GETs on the repo would be? Matt On Mon, Apr 9, 2012 at 11:08 PM, Barrie Treloar baerr...@gmail.com wrote: [del] 457GB is a lot of data, but it isn't an unimaginable amount, and most of that is no doubt the artifacts, not the metadata (pom files). [del] Assuming that you listened to Ron's reasoning, but you are going to go ahead anyway. 457GB would be the jar sizes. The pom's themselves wouldn't be that big. Maven Central isn't directly web browsable any more, but you could use the mirror at http://mirrors.ibiblio.org/pub/mirrors/maven2/ If you wanted to scrape Maven Central for just the poms then I'd contact Sonatype who manage the central repository. - To unsubscribe, e-mail: users-unsubscr...@maven.apache.org For additional commands, e-mail: users-h...@maven.apache.org