The current format we use for storing source code is not optimal for
archival and analysis purposes. Each mcz stores all source code. That
makes it difficult and slow. Today I've experimented with an archive
format that combines many mcz's and should be able to reconstruct all
individual ones.
I defined a MCProject, representing a project repository. The
definitions are stored in an OCLiteralSet
Object subclass: #MCProject
instanceVariableNames: 'location infos definitions repository'
classVariableNames: ''
package: 'MonticelloProjects'
For each filename found in the repository I load the MCVersion and its
snapshot.
MCProject>>read
| filenames |
repository := MCHttpRepository location: location user: '' password: ''.
filenames := repository readableFileNames.
filenames do: [ :each | self read: each ]
MCproject>>read: aFileName
"Needs a rate limiter!!!"
|mcVersion|
mcVersion := repository loadNotCachedVersionFromFileNamed: aFileName.
mcVersion snapshot.
self parse: mcVersion.
repository flushCache
For each unique package in those MCVersions, I add a MCPackageInfo,
defined as
Object subclass: #MCPackageInfo
instanceVariableNames: 'packageName packageVersions'
classVariableNames: ''
package: 'MonticelloProjects'
MCProject>>parse: aVersion
|info|
info := self ensureInfo: aVersion package.
info addVersion: aVersion in: self
a MCPackageVersion then stores the info and the unique definition
that is stored in the project, eliminating the duplicates.
MCPackageInfo>>addVersion: aMcVersion in: aProject
|packageVersion|
packageVersion := MCPackageVersion new
info: aMcVersion info;
yourself.
self packageVersions add: packageVersion.
aMcVersion snapshot definitions do: [ :aDefinition |
packageVersion definitions add:
(aProject definitions add: aDefinition) ].
As long as #= and #hash are correctly defined for all MCDefinitions,
this should make it possible too eliminate all duplicate definitions and
have a full history. On my Documentation repo this already saves
a factor 7, when saving this compressed as a Fuel file. On large
repositories with a high change rate (Roassal2?) the compression will be
significantly higher. There are several other normalizations that can
reduce the size further:
- make recategorization explicit
- normalize MCVersionInfo data: explicit author, compact timestamp.
I'd be interested in further ideas for this, and situations where this
approach wouldn't work.
Stephan