Thanks for getting back to me so fast.

The updated webrev is at:

http://cr.opensolaris.org/~bpytlik/ips-search/



Comments in line:

[EMAIL PROTECTED] wrote:
> Hey Brock,
>
> I took a look at your webrev.  You've written enough new code that I'm
> probably not able to review it as thoroughly as I can with the existing
> code.  It would be great if you could write up a description of your
> design, and how the different parts work.  That may give some of us who
> aren't experts in search issues a chance to provide overall feedback on
> the design.
>
>   
 From a high level, there are two components to this change: the 
indexer, which maintains the information needed for search, and the 
query engine, which actually performs a search of the information 
provided. The indexer is responsible for creating and updating the 
indexes and ensuring they're always in a consistent state. It does this 
by maintaining a set of inverted indexes as text files (details of which 
can be found in the comments at the top of indexer.py). On the server 
side, it's hooked into the publishing code so that the index is updated 
each  time a package is published. If indexing is already happening when 
packages are published, they're queued and another update to the indexes 
happens once the current run is finished. On the client side, it's 
hooked into the install, image-update, and uninstall code so that each 
of those actions are reflected in the index.

The query engine is responsible for processing the text from the user, 
searching for that token in its information, and giving the client code 
the information needed for a reasonable response to the user. It must 
ensure that the information it uses is in a consistent state. On the 
server, an engine is created during the server initialization. It reads 
in the files it needs and stores the data internally (this is why the 
server RAM footprint increased). When the server gets a search request 
from a client, it hands the search token to the query engine. The query 
engine ensures that it has the most recent information (locking and 
rereading the files from disk if necessary) and then searches for the 
token in its dictionaries. On the client, the process is the same except 
that the indexes are read from disk each time instead of being stored 
(since a new instance of pkg is run for each search).

Was that the kind of information you were looking for?
>> Here's the webrev for the new search backend:
>> http://cr.opensolaris.org/~bpytlik/ips-search/
>>     
>
> Here's some specific feedback that I've come up with so far:
>
> client.py:
>
>   - line 399-416, 500-517, 1432-1451: You've duplicated a lot of the
>     error handling code.  Would it make sense to commonize this in the
>     top-level exception handlers that we use to wrap main_func()?
>   
The exception handling in 399-416 and 500-517 have to be there. They 
critically prevent the exception from reaching the handler on lines 417 
and 518. The handlers at 417/518 conclude the installation has failed 
and restore the boot environment. It used to be the case that indexing 
was the last thing the client did before returning, which is why the 
boot environment should have been the new image, simply reindexing is 
much faster than redownloading all the packages. I've changed imageplan 
so that even if the indexing fails, the pkg_plans still go through their 
post-execution steps, so the image should, imo, still be considered 
bootable and consistent.

1432-1451 could be pulled out to the top level, but it wouldn't be a 
case of making code common, merely relocating (and the duplication of 
catching the same exceptions at two different levels might well 
encourage the accidental removal of the lower level ones). I have 
removed my catch of runtime errors and moved that to the top level so 
that it's handled consistently with other runtime errors.
>     Meta-question here too: Some of these errors instruct the user to
>     remove directories and then run, or re-run commands.  Would it make
>     more sense to:
>
>     a) Provide a command that repairs the indexes without any further
>     user intervetion
>   
Yes, I'll make this fix.
>     b) If yes to a, then can we just run the repair when we encounter an
>     error instead of returning a message?
>   
I think think notification is the right way to go here. The permissions 
error couldn't be handled internally as they'd need to reindex with 
sufficient permissions to change the directory. The other two errors can 
only happen if either indexing has crashed during a previous run or 
indexing is happening at the same time. For example, imagine someone 
starts rebuilding the index for an image in one shell and then 
manipulates that image in another shell. I don't have a good way of 
distinguishing a run that's still happening from one that broke. I could 
simply remove the directory (possibly out from under the running 
indexer) and start indexing again, but that made me feel a bit queasy. 
If that's the preferred solution, I'll implement that.
> depot.py:
>
>   - line 150: Can we use a lockfile insead of a scary warning here?
>   
I've changed the comment slightly to make it clearer that this argument 
should never be used by anyone other than the depot when it's reexecing 
itself.
> pkg.1.txt:
>
>   - I would omit "should be generally unnecessary, but"
>
>   
I've reworded things a bit. I just want to make it clear in the man page 
that it's not necessary to run rebuild-index unless something has broken.
> catalog.py:
>
>   - line 43: The catalog code is shared between the client and the
>     server.  Adding this import means that the client now depends upon
>     the server for correct functionality.  I don't think we want to do
>     this.
>
>   - line 147:  This suggests, perhaps, that we want a subclass of
>     Catalog for the server?
>   
I agree, I've moved server side search code out to a subclass. I won't 
do a full refactoring of catalog in this change as I think it would just 
confuse things more.
>   - line 208 & 209: Is there a reason why the tempfile needs to have the
>     same name for every invocation?  Would it be better to use
>     tempfile.mkstemp()?
>   
No, there's no reason. If tempfile.mkstemp is the preferred way of doing 
things, I'll switch to that. My only argument for using the other method 
is that then the file is in a known location when something breaks, so 
it might be easier to get debugging info from that file, rather than 
trying to figure out which file in /tmp is the relevant one. But, again, 
not something I feel strongly about.
>   - line 220: Shouldn't we drop the lock here?
>   
I don't think so. If you dropped it here, two threads could both acquire 
writing file handles to the same file, or, if we moved to 
tempfile.mkstemp, then it's possible one would blow away the changes of 
the other. It's for the same reason that I hold the lock past the point 
where Catalog.save_pkg_names is called. It would be bad to have two 
threads both dumping their pickled content to the same file.

Perhaps there's something I'm not seeing here, if what I've said above 
doesn't make sense, let me know and we can talk through it some more.
>   - line 249: The hold time on this lock looks long.  Who's contending
>     for this?
>   
Basically, contention scales linearly  on this lock by the number of 
publishers publishing simultaneously to the depot. I didn't think this 
would be a problem, but if it is, I can take a look at redesigning it.
>   - line 349: Who releases searchdb_lock?
>   
searchdb_lock is released in the child_handler, roughly line 471
>     More importantly, what's the lock ordering for the update lock
>     and searchdb lock?  This should be documented.  You should also
>     document what each lock protects, preferably in the constructor, or
>     somewhere conspicuous.
>   
After looking carefully at the code, I believe the searchdb lock is 
redundant given the other code that's in place. For now I've removed it, 
but I could be convinced to put it back if it would help the code be 
less brittle. As designed now, the code never tries to acquire that lock 
except exactly in those situations where it can get the lock 
immediately, which suggests there's no need for the lock now. As long as 
any future code which tries to index does the same check at the top of 
refresh_index, there won't be any problems. I've commented this to make 
it clearer, but I'd be open to the idea that a symbolic/failsafe lock 
might be worthwhile.
>   - line 376: Shouldn't we drop the update and searchdb locks here?    
>   
Yes
>   - line 377: Why don't we drop the updatelock for non-posix systems?
>
>   
We should, I've moved the location of the lock release to make sure the 
updatelock is released
> indexer.py:
>
>   - line 105: It seems like it would make more sense for the Indexer to
>     be able to determine it's max memory use when it's created.  You've
>     duplicated the max memory calculation in image and imageplan.  I
>     would either have __init__ figure this out, or pass __init__ a
>     function that it can call to figure this out.
>
>   
Good point, I've moved the code around a bit

>  - line 658 / 669: Why do we need separate update index functions for
>    the server and the client?
>
>   
The short answer is that one needs to pass in PKGS(=TRUE) while the 
other passes in FMRIS(=FALSE). The longer answer is that the functions, 
while generally very similar, do have a few differences. They have 
different input data structures, the client provides a list of pkg plans 
while the server provides a list of FMRI's, or to put it another way, 
the client has a way of specifying was to remove packages from the index 
while the server doesn't have that capability. The client also has 
filters it needs to apply while the server doesn't. Rather than expose 
the interface of passing in a value to determine which option to use, I 
decided to use separate functions to encode the difference as I found 
that to be a clearer interface.

> misc.py:
>
>   - line 328,329: Shouldn't we determine these dynamically, or at a
>     minimum not let this number exceed the amount of memory in the
>     machine?
>
>   
Ok, thanks for the pointer to sysconf. It now tries to use sysconf, and 
if that blows up it falls back to the defaults I've given.
> search_storage.py:
>
>   - lines 492-495: Remove dead code, please.
>
>
>   
oops
> Thanks,
>
> -j
>   
Other changes to the code:
Added logging on the server side to indicate when indexes became 
available or where updated
Added locking on the server side rereading of dictionaries to ensure 
that multiple search threads won't cause the internal representations to 
be used when they're only partially built. A better solution may be for 
the indexer to update the internal structures as well as the file 
structures. I will give some thought to that for the next round of 
search updates.

_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to