On Tuesday 25 October 2005 09:23, Richard Freeman wrote: > John Myers wrote: > > I designed a system where it took feedback from consenting users, sending > > the file lists back to my server, were I was going to do some data > > crunching. The data from just _my_ system was over 60 MB. > > It sounds like you really only need to index each package a few times at > most. Sure, the raw data from a user could be 60MB each, but there are > some ways to reduce that significantly: Hm. I forgot to mention that the largest pieces (the file names and the md5sums) are only stored once, and then referenced with a relatively small integer (compared to the size of, say, a file name)
Here's how it breaks down:
table | rows | size
----------------------+---------+--------
ebuilds | 994 | 118.3K
filenames | 381,200 | 27.1M
file info | 383,168 | 19.9M
installations list | 1,007 | 26.7K
extra install data | 1,007 | 88.2K
file->install mapping | 464,193 | 13.1M
There are some reinstallations and upgrades in the above data
> 1. Don't send in data for anything in the base system install.
>
> 2. As you populate your database, publish a list of indexed packages
> via a URL. Users would exclude any packages you've already indexed. If
> this were a GLEP you could probably put the file in the portage
> directory and everybody would get it via rsync.
>
> 3. Start by only indexing each package ONCE. Don't worry about every
> combo of arches, CFLAGS, USE, etc. That means that most users wouldn't
> upload anything at all, and the rest would only send their unique
> contributions.
Interesting thoughts
> If you get everything working without indexing by USE, you could start
> adding that capability in. Publish in #2 the list of USE flags indexed
> for each package, and individuals would only upload packages compiled
> with something that wasn't on that list.
>
> Sure, the final database could easily be 100MB or so, but if you just
> put it on a website you won't be sending the whole thing. Just put it
> in mysql/postgres and build a php front end (sorry, not a web dev
> personally, but it isn't that hard to do from the little I've messed
> with it).
that's what the intention was. Maybe with an XML-RPC service for a
command-line client to use. And the data is stored in a mysql database
>
> Sorry - I don't intend to make it sound like the whole thing can be done
> in 5 minutes, and I"m sure you've already poured hours into your effort.
> However, I don't see any theoretical issues with it as long as the
> design is right. The important thing is that users are only uploading
> diffs against your master repository - and not doing a complete dump of
> their entire system. Otherwise you will get buried in data!
The biggest problem is that there are a lot of potential variations, and they
all really need to be there for this to be useful
>
> I must admit that it is easy to just talk about ideas like this - I
> really do want to commend you on the work you've undoubtedly already
> accomplished! OSS projects require lots of hard work by many volunteers
> and it is all too easy for people like me to just sit back and nitpick
> what could be done better...
Well, I think I might hack around on this a little more
pgpdDBggSMLdV.pgp
Description: PGP signature
