State of the Archiva (Feb 2007)

Joakim Erdfelt Wed, 21 Feb 2007 16:14:14 -0800

I hope to have this 'State of the Archiva' be a fairly regular report
here in the [email protected] mailing list.


Here, the Past, Present, and Future of Archiva will be outlined.

:: PAST ::

  Archiva was started by Brett Porter back in Nov 2005 as webapp
  designed to replace the ageing maven-proxy product.

  It was called Maven Repository Manager back then, MRM for short (you
  can even see that the jira tracking still uses this moniker)

  It has undergone a name change and organic growth since then.

:: PRESENT ::

  The organic growth in Archiva reached a point where people could not
  reliably run Archiva for any significant period of time, or use it
  with large repository collections.

  Jira MRM-239 was born.

  There has been several profiling efforts done against the
  archiva/trunk codebase using the YourKit Java Profiler donated by
  YourKit to the Maven PMC.

  It identified that the memory and stability issues stemmed from 2
  major areas in the codebase.

  1) Repository Discovery.
     The implementation inside of Archiva maintained an in memory list
     of all artifact filenames present in the repository.
     This list could be as small as 300k and as high as 42MB (for the
     case of ibiblio).  This list spiked the memory to such a large
     degree that the garbage collector couldn't respond fast enough.
     Often leaving the JVM no other option but to trigger an
     OutOfMemory Exception.
  2) Reporting Database.
     The reporting database was kept in an XML file on disk.  It too
     was kept entirely in memory, and produced similar memory
     situations to the underlying JVM.

  To fix these two problems required near complete overhauls of those
  components.

  This work is being performed in the ...
  https://svn.apache.org/repos/asf/maven/archiva/branches/archiva-MRM-239
  ... branch.

  It is ready for testing.

  Highlights of this branch...

  *) Discovery is now done in a publish / subscribe model.
     A Consumer concept has been created to allow a predefined list of
     consumer to be handed to the discovery process, for each included
     hit against the repository, an Artifact, RepositoryMetadata, or
     MavenProject object is handed to the Consumer for processing.

     We have consumers for ArtifactHealth, MetadataHealth, and
     ArtifactIndexing at the moment.

  *) Reporting has been completely gutted and is currently using
     jpox (on a temporary basis)

  *) There is no longer an IndexerTask, it is now a much simpler
     and more flexible DataRefreshTask.

  *) A lot of the functionality in maven-core was moved into a new
     maven-common module that fit better into the multimodule tree.

:: FUTURE ::

  Brett and I want to get a stable archiva alpha release cut as soon
  as possible.

  :: Architecture ::

  In the process of overhauling archiva for the memory issues in
  MRM-239 it has become apparant that the architecture needs some
  realigning.

  This is what I propose (and I will do the work).

  Directory Structure:

    archiva/
      archiva-api/
      archiva-core/
        archiva-repository-api/
        archiva-repository-proxy-connector/
        archiva-repository-sync-connector/
        archiva-repository-migration-connector/
        archiva-configuration/
        archiva-consumer-api/
      archiva-reporting/
        archiva-report-manager/
        archiva-reports/
          archiva-rss-report/
          archiva-charts-report/
          archiva-health-report/
          archiva-trends-report/
          archiva-stats-report/
      archiva-dao/
      archiva-workflow/
      archiva-cli/
      archiva-web/
        archiva-ws-service/
        archiva-applet/
        archiva-security/
        archiva-webapp/
        archiva-standalone/
          archiva-plexus-application/
          archiva-plexus-runtime/

   Some details...

   maven-core needs to be completed gutted and its current objects
      refactored into the appropriate other modules.
   maven-api is intended to be the interface to archiva.  Not sure
      how succesful this will be.  Stay tuned.  Comments are invited.

   :: Repository Layer ::

   The current maven-repository-layer is underutilized, and needs to
   become the focus of repository interaction within archiva.

   I want to set up these following classes ...

     DefinedRepositories
        .getRemoteRepositories()
        .getManagedRepositories()
     ManagedRepository
     RemoteRepository
     RepositoryConnector
       (Examples of Repository Connector)
       RepositoryProxyConnector
       RepositorySyncConnector
       RepositoryMigrateConnector

   With these objects, we should be able to manage, present, store,
   reference the repositories with a maximum of flexibility.

   Imagine an admin screen that shows the following ...

   +---------+                            +----------------------------+
   | Managed |                            | Remote                     |
   | central |------[ Proxy Connector ]-->| http://ibiblio.org/maven2/ |
   +---------+   |                        +----------------------------+
                 |
                 |                        +----------------------------+
                 +--[ Proxy Connector ]-->| Remote                     |
                 |                        | http://repo1.maven.org/... |
                 |                        +----------------------------+
                 |
                 |                        +----------------------------+
                 +<--[ Sync Connector ]---| Remote                     |
                                          | http://www.jpox.org/maven2 |
                                          +----------------------------+

  To be able to conceptually glue together managed and remote
  repositories is very powerful.

  A migration of a repository from maven 1 to maven 2 could even be
  presented as a Migration Connector.

  [Maven 1 Repo]---->[Migrate Connector]---->[Maven 2 Repo]

  Connectors should also be the home for any whitelist and blacklist
  concepts. (example: against groupIds, or licenses).

  :: DAO Layer ::

  We need to rethink our over-reliance on the lucene database for all
  things database oriented.

  I feel that the lucene index is tremendously important, and needs to
  remain, even be enhanced.

  Some requirements...
    * Indexing Component / API: allow tool writers to utilize the
      lucene indexing infrastructure in their own tools for local
      repository indexing.
    * Indexing WS/REST Interface: allow the usage of an installed
      archiva index via the WS/REST interface, IDEs could use this
      (but that might not be technically feasible)
    * Downloadable Indexes:  IDE integration would utilize this
      functionality to provide speedy resolution of classes /
      artifacts / projects / etc...

  I feel that the lucene index should contain not only a list of all
  content present in the managed repositories, but also be able to
  track artifact content from artifacts submitted to archiva for
  indexing only.
  (think of the javax.* artifact details, can't allow it to be
  downloaded if apache.org is hosting the archiva instance, but that
  information is still valuable to have within the index.)

  The DAO Layer should also provide a database access layer for
  objects, and reports.

  I want to utilize an ORM technology, but that technology should allow
  for mapping existing objects WITHOUT enhancing them. (jpox/jdo is
  out).
  I'll be studying JPA and iBatis to see which will provide the
  unenhanced object model approach we need.

  The DAO will place demands on the database by the webapp's groupId
  and artifact browsing, but not the search facility, that will be
  handled by Lucene.

  :: Reporting Infrastructure ::

  This is sorely needed.  I do not yet have a good architecture in
  mind, but let me share with you some of my criteria.

  The reporting layer will use the archiva-dao layer.
  Static, historical, and dynamic information must be handled.

  Static content examples: statistics, state, configuration.
  Historical content examples: data refresh runs performance,
                               daily audit logs,
                               growth of groupId.
  Dynamic information examples: object information, artifact
  information, artifact health, metadata health.

  Some concepts I would like to see in the reporting API.
   * ReportData - reference to the data.
      .query()
      .add()
      .remove()
      .update()
   * HistoricalReportData / StaticReportData / DynamicReportData
   * ReportViews
     * ReportViewTable
     * ReportViewHistoricalChart
     * ReportViewHistoricalSparkline
     * ReportViewHistoricalRSS

  Some reports, static in nature, should be generated as part of the
  periodic DataRefresh Task.

  :: WebApp ::

  The structure of the webapp will largely remain intact.
  I have been encouraged to move away from xwork / webwork to struts 2,
  as those projects have merged.

  :: Workflow ::

  This concept is not yet present in archiva.

  Its intent is to provide consistent means to manage the repositories
  from within the archiva webapp interface.

  Some early workflow examples:

    1) Artifact Submission.  (webform to submit artifact bundles)
    2) Artifact Approval.    (proxied content must pass approval)
    3) Artifact Promotion.   (sandbox to internal repos, for example)
    4) Snapshot Cleanup.     (remove old snapshots)

  Could use some help in this area.
  Suggestions, techniques, etc...

  :: Clustering ::

  This perceived as mostly an installation issue, we need to document
  how to set up High Avaliability archiva instances, that can share
  load and/or fail over to another archiva instance.

  :: Index Sharing ::

  One or more archiva instances can share their indexes providing
  a web of information about as many artifacts as possible, actually
  being able to download the artifact is still goverened by the archiva
  security.

:: Conclusion ::

Exciting things are coming for archiva.
Stay tuned.
Join in on the testing.
Provide Feedback.
File Bug Reports.

- Joakim Erdfelt

State of the Archiva (Feb 2007)

Reply via email to