Re: How to get access to ALL the data in maven central?

2012-04-10 Thread Matt Taylor
Answered my own question to a degree.  For the benefit of the group here is
how to do it:

rsync -a -v --include */ --include *.pom --include *.xml --exclude *
--bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2

That will retrieve all of the pom and xml metadata files for the maven
central repository.

At first I tried to just do a full rsync, but ibiblio cut me off after
about 3.4G of transfer.  After an hour or so they let me back in, hence the
bwlimit of 1000KB/s to attempt to not hog their bandwidth.  Unfortunately
they don't seem to publish what their limits are so I guess I'll have to
play with it to see how long it takes me to get the all the data.

After I get all the poms I'll start in on the full repository via a slow
slurp.  I'm OK with it taking weeks to get the jars for the first sync, and
then once I have the full repo getting the updates shouldn't be so taxing.

Progress!

Matt


On Mon, Apr 9, 2012 at 9:20 PM, Matt Taylor m...@matthewjosephtaylor.comwrote:

 Perhaps this is already in existence somewhere.  If so please point me in
 the right direction.

 I want to know what the most popular dependancies are, not based on
 downloads, but based on dependancies from other projects.
 I want to explore the full dependency graph and see its evolution over
 'time' (for instance seeing how fast versions of artifacts are adopted).
 I want to create a visual representations of all the dependancies just
 because it would look cool.

 In general I want total access to all the metadata (pom files essentially)
 in the maven central repo, so I can see how the worlds software fits
 together on a 'global' scale.

 Eventually I would like to explore the jar artifacts as well to get deeper
 insights into what methods/classes are being referenced as well, but that
 is phase 2. :)

 From googling around is appears that understandably it is improper to
 simply wget the entire repo.  However, there don't seem to be any publicly
 available torrents, or other resources for me to get access to this data.

 http://search.maven.org/#stats

 457GB is a lot of data, but it isn't an unimaginable amount, and most of
 that is no doubt the artifacts, not the metadata (pom files).

 So I really have two questions:

 1. What is the easiest path to getting rsync type access of the full repo
 (I'd quite understand if I needed to pay a fee for this level of access).
 2. Failing that, what would be a legitimate way of just getting all the
 pom files?

 Basically I want to be a good guy and not put undo load on the servers,
 but at the same time I really want the data.

 Thanks,

 Matt Taylor
 http://blog.matthewjosephtaylor.com



Re: How to get access to ALL the data in maven central?

2012-04-10 Thread Barrie Treloar
On Tue, Apr 10, 2012 at 3:42 PM, Matt Taylor
m...@matthewjosephtaylor.com wrote:
 Answered my own question to a degree.  For the benefit of the group here is
 how to do it:

 rsync -a -v --include */ --include *.pom --include *.xml --exclude *
 --bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2

 That will retrieve all of the pom and xml metadata files for the maven
 central repository.

 At first I tried to just do a full rsync, but ibiblio cut me off after
 about 3.4G of transfer.  After an hour or so they let me back in, hence the
 bwlimit of 1000KB/s to attempt to not hog their bandwidth.  Unfortunately
 they don't seem to publish what their limits are so I guess I'll have to
 play with it to see how long it takes me to get the all the data.

 After I get all the poms I'll start in on the full repository via a slow
 slurp.  I'm OK with it taking weeks to get the jars for the first sync, and
 then once I have the full repo getting the updates shouldn't be so taxing.

You dont want to get the jar files.

They aren't going to tell you anything.

-
To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
For additional commands, e-mail: users-h...@maven.apache.org



Re: How to get access to ALL the data in maven central?

2012-04-10 Thread Matt Taylor
Actually I think the jars are going to tell me quite a bit.  By looking
into the class files I should be able to create a link between not only
what dependancies are being used by what projects but what methods/classes
are being used within each dependency as well.  I can then for instance
create a 'heat map' for each project to show what classes/methods are most
used within that project.


Matt

On Tue, Apr 10, 2012 at 1:14 AM, Barrie Treloar baerr...@gmail.com wrote:

 On Tue, Apr 10, 2012 at 3:42 PM, Matt Taylor
 m...@matthewjosephtaylor.com wrote:
  Answered my own question to a degree.  For the benefit of the group here
 is
  how to do it:
 
  rsync -a -v --include */ --include *.pom --include *.xml --exclude *
  --bwlimit=1000 mirrors.ibiblio.org::maven2/ maven2
 
  That will retrieve all of the pom and xml metadata files for the maven
  central repository.
 
  At first I tried to just do a full rsync, but ibiblio cut me off after
  about 3.4G of transfer.  After an hour or so they let me back in, hence
 the
  bwlimit of 1000KB/s to attempt to not hog their bandwidth.  Unfortunately
  they don't seem to publish what their limits are so I guess I'll have to
  play with it to see how long it takes me to get the all the data.
 
  After I get all the poms I'll start in on the full repository via a slow
  slurp.  I'm OK with it taking weeks to get the jars for the first sync,
 and
  then once I have the full repo getting the updates shouldn't be so
 taxing.

 You dont want to get the jar files.

 They aren't going to tell you anything.

 -
 To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
 For additional commands, e-mail: users-h...@maven.apache.org




Re: How to get access to ALL the data in maven central?

2012-04-10 Thread Wayne Fay
 If you wanted to scrape Maven Central for just the poms then I'd
 contact Sonatype who manage the central repository.

As Barrie said, you could talk to Sonatype (Brian specifically) since
they operate the Maven Central repo and they might be able to make a
zip file available that would be the result of tar'ing all the pom
files (no artifacts) in Central. I know you have a solution with rsync
but this might save some time.

Alternatively you could run your own local Repo Manager (Archiva,
Artifactory, Nexus) which would cache all the artifacts and poms. The
Aether API might be a useful thing to look at as well. You may be able
to specify just pull down the pom file and not the jar in the API at
least for the first pass, then decide if you want the jars as well for
a second pass later.

Wayne

-
To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
For additional commands, e-mail: users-h...@maven.apache.org



Re: How to get access to ALL the data in maven central?

2012-04-10 Thread Brian Fox
Make a request here and I can attach the poms for you:
https://issues.sonatype.org/browse/MVNCENTRAL

On Tue, Apr 10, 2012 at 1:17 PM, Wayne Fay wayne...@gmail.com wrote:

  If you wanted to scrape Maven Central for just the poms then I'd
  contact Sonatype who manage the central repository.

 As Barrie said, you could talk to Sonatype (Brian specifically) since
 they operate the Maven Central repo and they might be able to make a
 zip file available that would be the result of tar'ing all the pom
 files (no artifacts) in Central. I know you have a solution with rsync
 but this might save some time.

 Alternatively you could run your own local Repo Manager (Archiva,
 Artifactory, Nexus) which would cache all the artifacts and poms. The
 Aether API might be a useful thing to look at as well. You may be able
 to specify just pull down the pom file and not the jar in the API at
 least for the first pass, then decide if you want the jars as well for
 a second pass later.

 Wayne

 -
 To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
 For additional commands, e-mail: users-h...@maven.apache.org




How to get access to ALL the data in maven central?

2012-04-09 Thread Matt Taylor
Perhaps this is already in existence somewhere.  If so please point me in
the right direction.

I want to know what the most popular dependancies are, not based on
downloads, but based on dependancies from other projects.
I want to explore the full dependency graph and see its evolution over
'time' (for instance seeing how fast versions of artifacts are adopted).
I want to create a visual representations of all the dependancies just
because it would look cool.

In general I want total access to all the metadata (pom files essentially)
in the maven central repo, so I can see how the worlds software fits
together on a 'global' scale.

Eventually I would like to explore the jar artifacts as well to get deeper
insights into what methods/classes are being referenced as well, but that
is phase 2. :)

From googling around is appears that understandably it is improper to
simply wget the entire repo.  However, there don't seem to be any publicly
available torrents, or other resources for me to get access to this data.

http://search.maven.org/#stats

457GB is a lot of data, but it isn't an unimaginable amount, and most of
that is no doubt the artifacts, not the metadata (pom files).

So I really have two questions:

1. What is the easiest path to getting rsync type access of the full repo
(I'd quite understand if I needed to pay a fee for this level of access).
2. Failing that, what would be a legitimate way of just getting all the pom
files?

Basically I want to be a good guy and not put undo load on the servers, but
at the same time I really want the data.

Thanks,

Matt Taylor
http://blog.matthewjosephtaylor.com


Re: How to get access to ALL the data in maven central?

2012-04-09 Thread Ron Wheeler
You are going to be missing the key ingredient which is the application 
POMs that tell you what artifacts are actually used.


You might get some interesting information about things like log4j which 
is probably used by lots of things inside Maven Central.
You will be grossly misled about the use of things like CXF since it is 
hardly ever called by a library that would be submitted to Maven Central 
but is frequently used by project that are in private repositories.


You may be able to visualize a where used between libraries but you 
will have a lot of nodes that are never used which is not true.


You will have to figure out a way to separate projects that are still 
used and produced a ton of revisions 5 years ago but nothing since, from 
projects that are mature yet still active but only produce new versions 
every 18 months since they are stable and work, from projects that were 
very active and then died as they became unnecessary due to newer 
technologies being introduced.


You will also have trouble with projects that repackage their artifacts 
between major releases and change the GAV structure by redistributing 
the functionality.


Not sure that your project is going to produce any useful information 
and I fear that it will be misleading to anyone who does not look deeper 
into the raw data.


Visualization may just make it easier for incorrect conclusions to be 
developed.


Ron

On 09/04/2012 10:20 PM, Matt Taylor wrote:

Perhaps this is already in existence somewhere.  If so please point me in
the right direction.

I want to know what the most popular dependancies are, not based on
downloads, but based on dependancies from other projects.
I want to explore the full dependency graph and see its evolution over
'time' (for instance seeing how fast versions of artifacts are adopted).
I want to create a visual representations of all the dependancies just
because it would look cool.

In general I want total access to all the metadata (pom files essentially)
in the maven central repo, so I can see how the worlds software fits
together on a 'global' scale.

Eventually I would like to explore the jar artifacts as well to get deeper
insights into what methods/classes are being referenced as well, but that
is phase 2. :)

 From googling around is appears that understandably it is improper to
simply wget the entire repo.  However, there don't seem to be any publicly
available torrents, or other resources for me to get access to this data.

http://search.maven.org/#stats

457GB is a lot of data, but it isn't an unimaginable amount, and most of
that is no doubt the artifacts, not the metadata (pom files).

So I really have two questions:

1. What is the easiest path to getting rsync type access of the full repo
(I'd quite understand if I needed to pay a fee for this level of access).
2. Failing that, what would be a legitimate way of just getting all the pom
files?

Basically I want to be a good guy and not put undo load on the servers, but
at the same time I really want the data.

Thanks,

Matt Taylor
http://blog.matthewjosephtaylor.com




--
Ron Wheeler
President
Artifact Software Inc
email: rwhee...@artifact-software.com
skype: ronaldmwheeler
phone: 866-970-2435, ext 102



-
To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
For additional commands, e-mail: users-h...@maven.apache.org

Re: How to get access to ALL the data in maven central?

2012-04-09 Thread Barrie Treloar
On Tue, Apr 10, 2012 at 12:31 PM, Ron Wheeler
rwhee...@artifact-software.com wrote:
 You are going to be missing the key ingredient which is the application POMs
 that tell you what artifacts are actually used.

 You might get some interesting information about things like log4j which is
 probably used by lots of things inside Maven Central.
 You will be grossly misled about the use of things like CXF since it is
 hardly ever called by a library that would be submitted to Maven Central but
 is frequently used by project that are in private repositories.

 You may be able to visualize a where used between libraries but you will
 have a lot of nodes that are never used which is not true.

 You will have to figure out a way to separate projects that are still used
 and produced a ton of revisions 5 years ago but nothing since, from projects
 that are mature yet still active but only produce new versions every 18
 months since they are stable and work, from projects that were very active
 and then died as they became unnecessary due to newer technologies being
 introduced.

 You will also have trouble with projects that repackage their artifacts
 between major releases and change the GAV structure by redistributing the
 functionality.

 Not sure that your project is going to produce any useful information and I
 fear that it will be misleading to anyone who does not look deeper into the
 raw data.

 Visualization may just make it easier for incorrect conclusions to be
 developed.

 Ron
[del]
 457GB is a lot of data, but it isn't an unimaginable amount, and most of
 that is no doubt the artifacts, not the metadata (pom files).
[del]

Assuming that you listened to Ron's reasoning, but you are going to go
ahead anyway.
457GB would be the jar sizes.
The pom's themselves wouldn't be that big.

Maven Central isn't directly web browsable any more, but you could use
the mirror at http://mirrors.ibiblio.org/pub/mirrors/maven2/
If you wanted to scrape Maven Central for just the poms then I'd
contact Sonatype who manage the central repository.

-
To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
For additional commands, e-mail: users-h...@maven.apache.org



Re: How to get access to ALL the data in maven central?

2012-04-09 Thread Matt Taylor
I agree it is definitely going to be imperfect and it will in the end only
be a sampling of the real usage, but I think that it will still prove
interesting information.  As far as bogus conclusions reached from others:
I plan on putting in some effort into explaining what the results are, what
they mean and making them accessible. Hopefully I'll get it mostly right
and/or attract other smarter people who will carry on from me.  Time will
tell on that one. :)

I agree that figuring out the temporal aspects of the graph will be a hard
problem (but rewarding as well if I can tease out the evolution of the
ecosystem).  Version numbers provide a sort of ordering, but it's messy.

All in all I think you make some valid points as far as the difficulty, but
the challenges are part of what attract me to this.  Even if I
fail miserably, I'll still learn a ton, and hopefully have some fun along
the way.

Matt

On Mon, Apr 9, 2012 at 10:01 PM, Ron Wheeler rwhee...@artifact-software.com
 wrote:

 You are going to be missing the key ingredient which is the application
 POMs that tell you what artifacts are actually used.

 You might get some interesting information about things like log4j which
 is probably used by lots of things inside Maven Central.
 You will be grossly misled about the use of things like CXF since it is
 hardly ever called by a library that would be submitted to Maven Central
 but is frequently used by project that are in private repositories.

 You may be able to visualize a where used between libraries but you will
 have a lot of nodes that are never used which is not true.

 You will have to figure out a way to separate projects that are still used
 and produced a ton of revisions 5 years ago but nothing since, from
 projects that are mature yet still active but only produce new versions
 every 18 months since they are stable and work, from projects that were
 very active and then died as they became unnecessary due to newer
 technologies being introduced.

 You will also have trouble with projects that repackage their artifacts
 between major releases and change the GAV structure by redistributing the
 functionality.

 Not sure that your project is going to produce any useful information and
 I fear that it will be misleading to anyone who does not look deeper into
 the raw data.

 Visualization may just make it easier for incorrect conclusions to be
 developed.

 Ron


 On 09/04/2012 10:20 PM, Matt Taylor wrote:

 Perhaps this is already in existence somewhere.  If so please point me in
 the right direction.

 I want to know what the most popular dependancies are, not based on
 downloads, but based on dependancies from other projects.
 I want to explore the full dependency graph and see its evolution over
 'time' (for instance seeing how fast versions of artifacts are adopted).
 I want to create a visual representations of all the dependancies just
 because it would look cool.

 In general I want total access to all the metadata (pom files essentially)
 in the maven central repo, so I can see how the worlds software fits
 together on a 'global' scale.

 Eventually I would like to explore the jar artifacts as well to get deeper
 insights into what methods/classes are being referenced as well, but that
 is phase 2. :)

  From googling around is appears that understandably it is improper to
 simply wget the entire repo.  However, there don't seem to be any publicly
 available torrents, or other resources for me to get access to this data.

 http://search.maven.org/#stats

 457GB is a lot of data, but it isn't an unimaginable amount, and most of
 that is no doubt the artifacts, not the metadata (pom files).

 So I really have two questions:

 1. What is the easiest path to getting rsync type access of the full repo
 (I'd quite understand if I needed to pay a fee for this level of access).
 2. Failing that, what would be a legitimate way of just getting all the
 pom
 files?

 Basically I want to be a good guy and not put undo load on the servers,
 but
 at the same time I really want the data.

 Thanks,

 Matt Taylor
 http://blog.**matthewjosephtaylor.comhttp://blog.matthewjosephtaylor.com



 --
 Ron Wheeler
 President
 Artifact Software Inc
 email: rwhee...@artifact-software.com
 skype: ronaldmwheeler
 phone: 866-970-2435, ext 102




 -
 To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
 For additional commands, e-mail: users-h...@maven.apache.org



Re: How to get access to ALL the data in maven central?

2012-04-09 Thread Matt Taylor
Lol, Ron has valid points but I am indeed going forward (and have only
myself to blame).

Agreed I just need the pom files which are much smaller, but it is still
lots of hits on the web server.  If anyone knows of a 'nice' way of getting
just the pom files that would be good enough for the moment.  As a last
resort I suppose I could write something that attempts to slowly slurp them
down over time.  Anyone have any ideas what a 'reasonable' rate of doing
lots of small GETs on the repo would be?

Matt

On Mon, Apr 9, 2012 at 11:08 PM, Barrie Treloar baerr...@gmail.com wrote:


 [del]
  457GB is a lot of data, but it isn't an unimaginable amount, and most of
  that is no doubt the artifacts, not the metadata (pom files).
 [del]

 Assuming that you listened to Ron's reasoning, but you are going to go
 ahead anyway.
 457GB would be the jar sizes.
 The pom's themselves wouldn't be that big.

 Maven Central isn't directly web browsable any more, but you could use
 the mirror at http://mirrors.ibiblio.org/pub/mirrors/maven2/
 If you wanted to scrape Maven Central for just the poms then I'd
 contact Sonatype who manage the central repository.

 -
 To unsubscribe, e-mail: users-unsubscr...@maven.apache.org
 For additional commands, e-mail: users-h...@maven.apache.org