-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Myers wrote:
> 
> I designed a system where it took feedback from consenting users, sending the 
> file lists back to my server, were I was going to do some data crunching. The 
> data from just _my_ system was over 60 MB.

It sounds like you really only need to index each package a few times at
most.  Sure, the raw data from a user could be 60MB each, but there are
some ways to reduce that significantly:

1.  Don't send in data for anything in the base system install.

2.  As you populate your database, publish a list of indexed packages
via a URL.  Users would exclude any packages you've already indexed.  If
this were a GLEP you could probably put the file in the portage
directory and everybody would get it via rsync.

3.  Start by only indexing each package ONCE.  Don't worry about every
combo of arches, CFLAGS, USE, etc.  That means that most users wouldn't
upload anything at all, and the rest would only send their unique
contributions.

If you get everything working without indexing by USE, you could start
adding that capability in.  Publish in #2 the list of USE flags indexed
for each package, and individuals would only upload packages compiled
with something that wasn't on that list.

Sure, the final database could easily be 100MB or so, but if you just
put it on a website you won't be sending the whole thing.  Just put it
in mysql/postgres and build a php front end (sorry, not a web dev
personally, but it isn't that hard to do from the little I've messed
with it).

Sorry - I don't intend to make it sound like the whole thing can be done
in 5 minutes, and I"m sure you've already poured hours into your effort.
 However, I don't see any theoretical issues with it as long as the
design is right.  The important thing is that users are only uploading
diffs against your master repository - and not doing a complete dump of
their entire system.  Otherwise you will get buried in data!

I must admit that it is easy to just talk about ideas like this - I
really do want to commend you on the work you've undoubtedly already
accomplished!  OSS projects require lots of hard work by many volunteers
and it is all too easy for people like me to just sit back and nitpick
what could be done better...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDXlvkg2bN8aFizRkRArU+AKCnEBdpoO2Acnwh3+FFR8CYj5CLtACcCboB
2QIb31yXVdW0EQST8PEUPeY=
=VF5P
-----END PGP SIGNATURE-----

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to