Re: [gentoo-portage-dev] Re: search functionality in emerge

Tambet Tue, 02 Dec 2008 05:52:09 -0800

Btw. one way of index update would be:

emerge --sync:


   - Add delete index if it exists

emerge --searchindexon

   - Turns index on

emerge --searchindexoff

   - Turns index off
   - Deletes index if it exists

emerge -s or emerge -S

   - Does normal search if index is off
   - If index is on and does not exist, then creates index on the fly
   - If index is on and does exist, then uses it

Tambet - technique evolves to art, art evolves to magic, magic evolves to
just doing.


2008/12/2 Tambet <[EMAIL PROTECTED]>

> About zipping.. Default settings might not really be good idea - i think
> that "fastest" might be even better. Considering that portage tree contains
> same word again and again (like "applications") it needs pretty small
> dictionary to make it much smaller. Decompressing will not be reading from
> disc, decompressing and writing back to disc as in your case probably - try
> decompression to memory drive and you might get better numbers.
>
> I have personally used compression in one c++ application and with optimum
> settings, it made things much faster - those were files, where i had for
> example 65536 16-byte integers, which could be zeros and mostly were; I
> didnt care about creating better file format, but just compressed the whole
> thing.
>
> I suggest you to compress esearch db, then decompress it to memory drive
> and give us those numbers - might be considerably faster.
>
> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip
> support. Try open of that and normal open on esearch db; also compress with
> the same lib to get right kind of file.
>
> Anyway - maybe this compression should be later added and optional.
>
> Tambet - technique evolves to art, art evolves to magic, magic evolves to
> just doing.
>
>
> 2008/12/2 Alec Warner <[EMAIL PROTECTED]>
>
> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <[EMAIL PROTECTED]> wrote:
>> > 2008/12/2 Emma Strubell <[EMAIL PROTECTED]>
>> >>
>> >> True, true. Like I said, I don't really use overlays, so excuse my
>> >> igonrance.
>> >
>> > Do you know an order of doing things:
>> >
>> > Rules of Optimization:
>> >
>> > Rule 1: Don't do it.
>> > Rule 2 (for experts only): Don't do it yet.
>> >
>> > What this actually means - functionality comes first. Readability comes
>> > next. Optimization comes last. Unless you are creating a fancy 3D engine
>> for
>> > kung fu game.
>> >
>> > If you are going to exclude overlays, you are removing functionality -
>> and,
>> > indeed, absolutely has-to-be-there functionality, because noone would
>> > intuitively expect search function to search only one subset of
>> packages,
>> > however reasonable this subset would be. So, you can't, just can't, add
>> this
>> > package into portage base - you could write just another external search
>> > package for portage.
>> >
>> > I looked this code a bit and:
>> > Portage's "__init__.py" contains comment "# search functionality". After
>> > this comment, there is a nice and simple search class.
>> > It also contains method "def action_sync(...)", which contains
>> > synchronization stuff.
>> >
>> > Now, search class will be initialized by setting up 3 databases -
>> porttree,
>> > bintree and vartree, whatever those are. Those will be in self._dbs
>> array
>> > and porttree will be in self._portdb.
>> >
>> > It contains some more methods:
>> > _findname(...) will return result of self._portdb.findname(...) with
>> same
>> > parameters or None if it does not exist.
>> > Other methods will do similar things - map one or another method.
>> > execute will do the real search...
>> > Now - "for package in self.portdb.cp_all()" is important here ...it
>> > currently loops over whole portage tree. All kinds of matching will be
>> done
>> > inside.
>> > self.portdb obviously points to porttree.py (unless it points to fake
>> tree).
>> > cp_all will take all porttrees and do simple file search inside. This
>> method
>> > should contain optional index search.
>> >
>> >               self.porttrees = [self.porttree_root] + \
>> >                       [os.path.realpath(t) for t in
>> self.mysettings["PORTDIR_OVERLAY"].split()]
>> >
>> > So, self.porttrees contains list of trees - first of them is root,
>> others
>> > are overlays.
>> >
>> > Now, what you have to do will not be harder just because of having
>> overlay
>> > search, too.
>> >
>> > You have to create method def cp_index(self), which will return
>> dictionary
>> > containing package names as keys. For oroot... will be
>> "self.porttrees[1:]",
>> > not "self.porttrees" - this will only search overlays. d = {} will be
>> > replaced with d = self.cp_index(). If index is not there, old version
>> will
>> > be used (thus, you have to make internal porttrees variable, which
>> contains
>> > all or all except first).
>> >
>> > Other methods used by search are xmatch and aux_get - first used several
>> > times and last one used to get description. You have to cache results of
>> > those specific queries and make them use your cache - as you can see,
>> those
>> > parts of portage are already able to use overlays. Thus, you have to put
>> > your code again in beginning of those functions - create index_xmatch
>> and
>> > index_aux_get methods, then make those methods use them and return their
>> > results unless those are None (or something other in case none is
>> already
>> > legal result) - if they return None, old code will be run and do it's
>> job.
>> > If index is not created, result is None. In index_** methods, just check
>> if
>> > query is what you can answer and if it is, then answer it.
>> >
>> > Obviously, the simplest way to create your index is to delete index,
>> then
>> > use those same methods to query for all nessecary information - and
>> fastest
>> > way would be to add updating index directly into sync, which you could
>> do
>> > later.
>> >
>> > Please, also, make those commands to turn index on and off (last one
>> should
>> > also delete it to save disk space). Default should be off until it's
>> fast,
>> > small and reliable. Also notice that if index is kept on hard drive, it
>> > might be faster if it's compressed (gz, for example) - decompressing
>> takes
>> > less time and more processing power than reading it fully out.
>>
>> I'm pretty sure your mistaken here, unless your index is stored on a
>> floppy or something really slow.
>>
>> A disk read has 2 primary costs.
>>
>> Seek Time: Time for the head to seek to the sector of disk you want.
>> Spin Time: Time for the platter to spin around such that the sector
>> you want is under the read head.
>>
>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
>> rotations per second, so worst case (you just passed the sector you
>> need) you need to wait 1/120th of a second (or 8ms).
>>
>> Seek Time is per hard drive, but most drives provide average seek
>> times under 10ms.
>>
>> So it takes on average 18ms to get to your data, then you start
>> reading.  The index will not be that large (my esearchdb is 2 megs,
>> but lets assume 10MB for this compressed index).
>>
>> I took a 10MB meg sqlite database and compressed it with gzip (default
>> settings) down to 5 megs.
>> gzip -d on the database takes 300ms, catting the decompressed data
>> base takes 88ms (average of 5 runs, drop disk caches between runs).
>>
>> I then tried my vdb_metadata.pickle from
>> /var/cache/edb/vdb_metadata.pickle
>>
>> 1.3megs compresses to 390k.
>>
>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
>> disk.
>>
>> Your index would have to be very large or very fragmented on disk
>> (requiring more than one seek) to see a significant gain in file
>> compression (gzip scales linearly).
>>
>> In short, don't compress the index ;p
>>
>> >
>> > Have luck!
>> >
>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> Hash: SHA1
>> >>>
>> >>> Emma Strubell schrieb:
>> >>> > 2) does anyone really need to search an overlay anyway?
>> >>>
>> >>> Of course. Take large (semi-)official overlays like sunrise. They can
>> >>> easily be seen as a second portage tree.
>> >>> -----BEGIN PGP SIGNATURE-----
>> >>> Version: GnuPG v2.0.9 (GNU/Linux)
>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>> >>>
>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
>> >>> =+lCO
>> >>> -----END PGP SIGNATURE-----
>> >>>
>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <[EMAIL PROTECTED]
>> >
>> >> wrote:
>> >>
>> >
>> >
>>
>
>

Re: [gentoo-portage-dev] Re: search functionality in emerge

Reply via email to