The current format we use for storing source code is not optimal for archival and analysis purposes. Each mcz stores all source code. That makes it difficult and slow. Today I've experimented with an archive format that combines many mcz's and should be able to reconstruct all individual ones.

I defined a MCProject, representing a project repository. The definitions are stored in an OCLiteralSet

Object subclass: #MCProject
        instanceVariableNames: 'location infos definitions repository'
        classVariableNames: ''
        package: 'MonticelloProjects'

For each filename found in the repository I load the MCVersion and its snapshot.

MCProject>>read
        | filenames |
        repository := MCHttpRepository location: location user: '' password: ''.
        filenames := repository readableFileNames.
        filenames do: [ :each | self read: each ]
        
MCproject>>read: aFileName
        "Needs a rate limiter!!!"
        |mcVersion|
        mcVersion := repository loadNotCachedVersionFromFileNamed: aFileName.
        mcVersion snapshot.
        self parse: mcVersion.
        repository flushCache



For each unique package in those MCVersions, I add a MCPackageInfo, defined as

Object subclass: #MCPackageInfo
        instanceVariableNames: 'packageName packageVersions'
        classVariableNames: ''
        package: 'MonticelloProjects'

MCProject>>parse: aVersion
        |info|
        info := self ensureInfo: aVersion package.
        info addVersion: aVersion in: self

a MCPackageVersion then stores the info and the unique definition
that is stored in the project, eliminating the duplicates.

MCPackageInfo>>addVersion: aMcVersion in: aProject
        |packageVersion|
        packageVersion := MCPackageVersion new
                info: aMcVersion info;
                yourself.
        self packageVersions add: packageVersion.
        aMcVersion snapshot definitions do: [ :aDefinition |
                        packageVersion definitions add:
                                (aProject definitions add: aDefinition) ].

As long as #= and #hash are correctly defined for all MCDefinitions, this should make it possible too eliminate all duplicate definitions and have a full history. On my Documentation repo this already saves a factor 7, when saving this compressed as a Fuel file. On large repositories with a high change rate (Roassal2?) the compression will be significantly higher. There are several other normalizations that can reduce the size further:
- make recategorization explicit
- normalize MCVersionInfo  data: explicit author, compact timestamp.

I'd be interested in further ideas for this, and situations where this approach wouldn't work.

Stephan


Reply via email to