I hope to have this 'State of the Archiva' be a fairly regular report
here in the [email protected] mailing list.
Here, the Past, Present, and Future of Archiva will be outlined.
:: PAST ::
Archiva was started by Brett Porter back in Nov 2005 as webapp
designed to replace the ageing maven-proxy product.
It was called Maven Repository Manager back then, MRM for short (you
can even see that the jira tracking still uses this moniker)
It has undergone a name change and organic growth since then.
:: PRESENT ::
The organic growth in Archiva reached a point where people could not
reliably run Archiva for any significant period of time, or use it
with large repository collections.
Jira MRM-239 was born.
There has been several profiling efforts done against the
archiva/trunk codebase using the YourKit Java Profiler donated by
YourKit to the Maven PMC.
It identified that the memory and stability issues stemmed from 2
major areas in the codebase.
1) Repository Discovery.
The implementation inside of Archiva maintained an in memory list
of all artifact filenames present in the repository.
This list could be as small as 300k and as high as 42MB (for the
case of ibiblio). This list spiked the memory to such a large
degree that the garbage collector couldn't respond fast enough.
Often leaving the JVM no other option but to trigger an
OutOfMemory Exception.
2) Reporting Database.
The reporting database was kept in an XML file on disk. It too
was kept entirely in memory, and produced similar memory
situations to the underlying JVM.
To fix these two problems required near complete overhauls of those
components.
This work is being performed in the ...
https://svn.apache.org/repos/asf/maven/archiva/branches/archiva-MRM-239
... branch.
It is ready for testing.
Highlights of this branch...
*) Discovery is now done in a publish / subscribe model.
A Consumer concept has been created to allow a predefined list of
consumer to be handed to the discovery process, for each included
hit against the repository, an Artifact, RepositoryMetadata, or
MavenProject object is handed to the Consumer for processing.
We have consumers for ArtifactHealth, MetadataHealth, and
ArtifactIndexing at the moment.
*) Reporting has been completely gutted and is currently using
jpox (on a temporary basis)
*) There is no longer an IndexerTask, it is now a much simpler
and more flexible DataRefreshTask.
*) A lot of the functionality in maven-core was moved into a new
maven-common module that fit better into the multimodule tree.
:: FUTURE ::
Brett and I want to get a stable archiva alpha release cut as soon
as possible.
:: Architecture ::
In the process of overhauling archiva for the memory issues in
MRM-239 it has become apparant that the architecture needs some
realigning.
This is what I propose (and I will do the work).
Directory Structure:
archiva/
archiva-api/
archiva-core/
archiva-repository-api/
archiva-repository-proxy-connector/
archiva-repository-sync-connector/
archiva-repository-migration-connector/
archiva-configuration/
archiva-consumer-api/
archiva-reporting/
archiva-report-manager/
archiva-reports/
archiva-rss-report/
archiva-charts-report/
archiva-health-report/
archiva-trends-report/
archiva-stats-report/
archiva-dao/
archiva-workflow/
archiva-cli/
archiva-web/
archiva-ws-service/
archiva-applet/
archiva-security/
archiva-webapp/
archiva-standalone/
archiva-plexus-application/
archiva-plexus-runtime/
Some details...
maven-core needs to be completed gutted and its current objects
refactored into the appropriate other modules.
maven-api is intended to be the interface to archiva. Not sure
how succesful this will be. Stay tuned. Comments are invited.
:: Repository Layer ::
The current maven-repository-layer is underutilized, and needs to
become the focus of repository interaction within archiva.
I want to set up these following classes ...
DefinedRepositories
.getRemoteRepositories()
.getManagedRepositories()
ManagedRepository
RemoteRepository
RepositoryConnector
(Examples of Repository Connector)
RepositoryProxyConnector
RepositorySyncConnector
RepositoryMigrateConnector
With these objects, we should be able to manage, present, store,
reference the repositories with a maximum of flexibility.
Imagine an admin screen that shows the following ...
+---------+ +----------------------------+
| Managed | | Remote |
| central |------[ Proxy Connector ]-->| http://ibiblio.org/maven2/ |
+---------+ | +----------------------------+
|
| +----------------------------+
+--[ Proxy Connector ]-->| Remote |
| | http://repo1.maven.org/... |
| +----------------------------+
|
| +----------------------------+
+<--[ Sync Connector ]---| Remote |
| http://www.jpox.org/maven2 |
+----------------------------+
To be able to conceptually glue together managed and remote
repositories is very powerful.
A migration of a repository from maven 1 to maven 2 could even be
presented as a Migration Connector.
[Maven 1 Repo]---->[Migrate Connector]---->[Maven 2 Repo]
Connectors should also be the home for any whitelist and blacklist
concepts. (example: against groupIds, or licenses).
:: DAO Layer ::
We need to rethink our over-reliance on the lucene database for all
things database oriented.
I feel that the lucene index is tremendously important, and needs to
remain, even be enhanced.
Some requirements...
* Indexing Component / API: allow tool writers to utilize the
lucene indexing infrastructure in their own tools for local
repository indexing.
* Indexing WS/REST Interface: allow the usage of an installed
archiva index via the WS/REST interface, IDEs could use this
(but that might not be technically feasible)
* Downloadable Indexes: IDE integration would utilize this
functionality to provide speedy resolution of classes /
artifacts / projects / etc...
I feel that the lucene index should contain not only a list of all
content present in the managed repositories, but also be able to
track artifact content from artifacts submitted to archiva for
indexing only.
(think of the javax.* artifact details, can't allow it to be
downloaded if apache.org is hosting the archiva instance, but that
information is still valuable to have within the index.)
The DAO Layer should also provide a database access layer for
objects, and reports.
I want to utilize an ORM technology, but that technology should allow
for mapping existing objects WITHOUT enhancing them. (jpox/jdo is
out).
I'll be studying JPA and iBatis to see which will provide the
unenhanced object model approach we need.
The DAO will place demands on the database by the webapp's groupId
and artifact browsing, but not the search facility, that will be
handled by Lucene.
:: Reporting Infrastructure ::
This is sorely needed. I do not yet have a good architecture in
mind, but let me share with you some of my criteria.
The reporting layer will use the archiva-dao layer.
Static, historical, and dynamic information must be handled.
Static content examples: statistics, state, configuration.
Historical content examples: data refresh runs performance,
daily audit logs,
growth of groupId.
Dynamic information examples: object information, artifact
information, artifact health, metadata health.
Some concepts I would like to see in the reporting API.
* ReportData - reference to the data.
.query()
.add()
.remove()
.update()
* HistoricalReportData / StaticReportData / DynamicReportData
* ReportViews
* ReportViewTable
* ReportViewHistoricalChart
* ReportViewHistoricalSparkline
* ReportViewHistoricalRSS
Some reports, static in nature, should be generated as part of the
periodic DataRefresh Task.
:: WebApp ::
The structure of the webapp will largely remain intact.
I have been encouraged to move away from xwork / webwork to struts 2,
as those projects have merged.
:: Workflow ::
This concept is not yet present in archiva.
Its intent is to provide consistent means to manage the repositories
from within the archiva webapp interface.
Some early workflow examples:
1) Artifact Submission. (webform to submit artifact bundles)
2) Artifact Approval. (proxied content must pass approval)
3) Artifact Promotion. (sandbox to internal repos, for example)
4) Snapshot Cleanup. (remove old snapshots)
Could use some help in this area.
Suggestions, techniques, etc...
:: Clustering ::
This perceived as mostly an installation issue, we need to document
how to set up High Avaliability archiva instances, that can share
load and/or fail over to another archiva instance.
:: Index Sharing ::
One or more archiva instances can share their indexes providing
a web of information about as many artifacts as possible, actually
being able to download the artifact is still goverened by the archiva
security.
:: Conclusion ::
Exciting things are coming for archiva.
Stay tuned.
Join in on the testing.
Provide Feedback.
File Bug Reports.
- Joakim Erdfelt