Re: [pkg-discuss] Code Review request: 1949, 983, 2122

Brock Pytlik Wed, 16 Jul 2008 17:16:36 -0700

Hi Shawn, thanks taking a look.
If I've snipped a suggestion, that means "ok, I've made that change"


Comments in line.

General comment:

          563 +                fmri_dir_path = os.path.join(self.imgdir, "pkg",
          564 +                                             fmri.get_dir_path())

    Indentation for line 564 should be four spaces in from the previous
    line as far as I know.
      

That indentation follows the PEP8 standard as far as I can tell. The 
indentation we're using in general is non-standard. I've changed to 
match the existing code where you've noted it, but to be clear, this 
isn't PEP8. Other than the 8 spaces issue, are we moving to PEP8 or 
continuing to use our own standards? (Another place we're inconsistent 
with the standard is that the standard says """ to close a block 
docstring should have its own line and a blank line before it.)

Shawn Walker wrote:
> 2008/7/13 Brock Pytlik <[EMAIL PROTECTED]>:
>   
>> Another webrev for search is up, based on suggestions from Dan, the
>> memory requirements have been curtailed quite a bit. Before indexing,
>> the process seems to top out at around 116M. The indexing then takes
>> that to 133M (instead of the 198M it used to require).
>>
>> The revised webrev is here:
>>
>> http://cr.opensolaris.org/~bpytlik/ips-search/
>>     
>
> http://cr.opensolaris.org/~bpytlik/ips-search/src/client.py.wdiff.html
> =========
>      1419 +        quiet = False
>
> This doesn't seem useful given that it will always be False.
>
> Was the intent here to respect or have a verbose option?
>   
All the other similar functions which have a progress tracker have this 
same variable set (see verify, install, image-update, etc...) so I used 
the same idiom used elsewhere in the code.
>
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/client_query_engine.py.html
> ==========
> Why is this called client_query_engine instead of just query_engine
> given that it's in the client folder?
>
>   
Because I find the naming convention easier to understand and clearer if 
the module includes whether it's server or client (especially when, 
server, client, and neither versions all exist) and I can't see any harm 
in having it there. (Basically, client may be in the import, but if I'm 
working with the code, and I see Catalog, or QueryEngine, I'm heading to 
the modules named that, and when there are two modules named X, I tend 
to go to the general one, which may or not be the one I actually want to 
look at.)
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/image.py.wdiff.html
> ==========.
>
>      1174 +                query = query_e.Query(args[0])
>      1175 +                return qe.search(query)
>
> Why assign to query instead of just pass to qe.search?
>   
For clarity? I could also do "return 
query_e.ClientQueryEngine(self.index_dir).search(query_e.Query(args[0]))"
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/imageplan.py.wdiff.html
> ==========
>
>       525 +                self.progtrack.actions_set_goal("Index
> Phase", len(plan_info))
>
> Can you use a constant for the phase instead of hard-coding it?
>   
I could. I could put a constant INDEX_PHASE = "Index Phase" at the top 
of the file and then use it here. It's not repeated anywhere so there's 
no commonality. Is this really preferable over having the string inline? 
If so I'll change it.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/progress.py.wdiff.html
> ==========
>   
I think I have to ask why for these. I followed the standards in the 
nearby code.
>        73 +                self.ind_phase = "None"
>        74 +                self.ind_phase_last = "None"
>
>   
This I'm not going to change as it's a functional change and I only have 
a vague understanding of the issues involved. It's also not clear to me 
whether the action code and the index code should have the same timing 
or not. So, I'll pull the index one out into a constant, and leave that 
decision for whoever wants to revamp this module overall. As a side 
note. I think what's really needed here is a generic update function, 
rather than specific ones, but that wasn't in the scope for this set of 
changes.
>       577 +                if force or (time.time() -
> self.last_print_time) >= 0.05:
>
> Hrm. I'm not keen on the hard-coded 0.05. This seems like it
> introduces a possible timing issue. Could you use a constant here at
> the least?
>
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/indexer.py.html
> ==========
> Lines 50, 57, 64 and 67 should probably just start with a '#' to be
> consistent with general style.
>
> While your comments about the file format are helpful, I would prefer
> to see a sample format first followed by a description of the sample.
> That might allow you to remove a good portion of the text by relying
> on the example to explain the format.
>   
# Here is an example of a line from the main dictionary, it is explained 
below:
# %gconf.xml (5,3,65689 => 249,202) (5,3,65690 => 249,202) (5,3,65691 => 
249,202)
# (5,3,65692 => 249,202)
#
# The main dictionary has a more complicated format. Each line begins with a
# search token (%gconf.xml) followed by a list of mappings. Each mapping 
takes
# a token_type, action, and keyvalue tuple ((5,3,65689), (5,3,65690),
# (5,3,65691), (5,3,65692)) to a list of pkg-stem, version pairs 
(249,202) in
# which the token is found in an action with token_type, action, and 
keyvalues
# matching the tuple. Further compaction is gained by storing everything but
# the token as an id which the other dictionaries can turn into 
human-readable
# content.

Here's what I changed it to. Is that what you had in mind?
>  109                 self._max_mem_use =
> float(os.environ.get("PKG_INDEX_MAX_RAM",
>
> You refer to this as max_mem everywhere in the code, so it would make
> more sense for the environment variable to be PKG_INDEX_MAX_MEM as
> well for consistency.
>   
Actually, I think it makes more sense to make the change in the other 
direction. I've changed mem to ram.
>  146                 self._tmp_dir = os.path.join(self._index_dir, "TMP")
>
> Personal preference: why uppercase for the directory name?
>   
Because it makes it stand out more, makes it clear it's a temporary 
directory. I prefer it over tmp.
>  168                             "Reading Existing Index", 
> len(self._data_dict))
>
> I would like to see constants used for goals.
>   
See my above question, why?
>  256                 """ Estimates RAM used based on size of added and
>  257                 removed dictionaries. It returns an estimate in KB.
>  258                 """
>
> Nitpick: s/RAM/memory/ for consistency.
>
> Nitpick: s/estimate in KB/estimated size in KiB/ :-)
>   
Just curious, why KiB? I haven't seen that used.
>  259                 return 0.5892 * dict_size +  -0.12295 * ids + \
>  260                     -0.009595 * total_terms + 23512
>
> It would be nice to see these numbers as constants. On a side note, I
> know that these numbers worked, but "magic values" always bother me...
>   
I'm going to leave these as is, b/c this is really a hack that should 
die. I think leaving them in the code encourages someone to think very 
carefully about what they're doing when they change them instead of 
viewing them as constants to be tweaked. In other words, these ARE magic 
values. To me, they're qualitatively different than the TIMOUT_SECS (for 
example) and promoting them to constants at the top of the file gives 
them a status they shouldn't have. It would also mean, other code could 
potentially get at these numbers, something else I think we want to 
avoid without VERY careful consideration. On a side note, I continue to 
be open to suggestions on how to do this in Python 2.4 better. IIRC, Dan 
mentioned that Python 2.6 (I think) has a way to determine the size of a 
structure.
> Is there a way to tie these numbers to the current version of the
> index format such that they will become invalidated the next time we
> change it? (In other words, force us to either review or update these
> numbers when we change the format.)
>   
I'm open to suggestions. And, actually, it's not dependent on the index 
format. It's dependent on the structures used to hold that information 
in memory, so tying it to the index format (which to me means on disk 
format) isn't the right thing to do.
>  441                 """ Process the main dictionary file and writes out a new
>  442                 main dictionary file reflecting the changes in
> the packages.
>  443                 """
>
> Wording is a bit off. Suggested text:
>
> Processes the main dictionary file and writes out a new
> one reflecting any package changes.
>   
I've gone with:
Processes the main dictionary file and writes out a new
                main dictionary file reflecting the changes in the packages.
as I really prefer "changes in the packages" over "any package changes" 
(which sounds to me like it only applies to changes in which packages 
are installed, not changes to the installed packages)
>  622                                 raise RuntimeError("Got unknown
> input_type: " +
>  623                                     str(input_type))
>
> Use %s instead?
>   
Changed but why? I find %s harder to read than + str(bleh)
>  676                         except Exception, e:
>
> You don't use the e here.
>
>  676                         except Exception, e:
>  677                                 res = False
> ...
>  681                 if res == 0:
>  682                         res = True
>  683                 return res
>
> There's a possible bug here. If an exception happened and res was set
> to False, line 681 will cause it to be set to True. Python considers 0
> and False to be equivalent.
>
> However, you could do this instead which is shorter anyway and avoids
> that possible issue (assuming you didn't intend that):
> return res and True or False
>
>   
Changed to this:
                try:
                        try:
                                res = \
                                    
ss.consistent_open(self._data_dict.values(),
                                        self._index_dir)
                        except Exception:
                                return False
                finally:
                        for d in self._data_dict.values():
                                d.close_file_handle()
                if res == 0:
                        res = True
                return res

>  731                 if self.file_version_number:
>  732                         raise RuntimeError("Got file_version_number 
> other"
>  733                                            "than None in setup.")
>
> This looks like it should be the first thing done before line 720.
>   
Not necessarily. If setup is called and the files are present, the 
behavior would change and that change would break code in other locations.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/query_engine.py.html
> ==========
>
>   88                         }
>
> I think you need to de-indent this by 8 spaces.
>   
When it's not obvious to me, I trust emacs, but it's been changed.
>   91                 """ Searches the indexes in dir_path for any
> matches of query
>   92                 and returns the results. The method assumes the
> dictionaries
>   93                 have already been loaded and read appropriately.
>   94                 """
>
> A comment about the return value would be nice.
>   
There isn't a return value for this function.
>
>  149                 # Construct the answer for the search_0 format
>
> Possible wording:
>
> Construct the answer for version 0 of the search format.
>   
I like search_0 better as it indicates exactly what the relevant methods 
are.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/search_errors.py.html
> ==========
> I'm not keen on having a separate search errors class. You're also
> missing a CDDL header.
>   
Header's added. I'm VERY much in favor of specific error classes. I 
don't care much which file they live in though. It seemed silly to put 
them in misc. I put them in a separate module because they're sometimes 
imported when the indexer and query_engine modules are not imported. 
Maybe we should have an errors.py or pkg_errors.py which has all the 
cross module exceptions.
> It seems like this class could be merged with another one or we need a
> new subdirectory for search.
>   
I'm fine with putting search into a subdirectory. I'm not sure why this 
tied to having a separate search errors class.
>    1 # __str__ methods defined for subclasses of IndexError should be defined
>    2 # for the server implementations. If the client needs different messages
>    3 # displayed, catch the exception on the client side and display a
> custom message
>
> Missing punctuation. Could change this to """ instead of # I think.
>   
I really think this is a comment, not a docstring for the module.
>    6         def __init__(self, cause):
>    7                 self.cause = cause
>
> "Python in a Nutshell: A Desk Reference" recommends the following
> syntax instead:
>
>        def __init__(self, *args):
>                exceptions.Exception.__init__(self, *args)
>
> In addition, a discussion from last year indicates that the Python
> developers (Gudio, et al.) believe that a recommendation to not rely
> on self.args is appropriate. Instead, explicitly named arguments are
> the preference [1].
>
> In that sense, what you have is good by using "cause", etc. but would
> be better as a named parameter (e.g. cause = None?).
>   
Sorry, you've lost me here. This exception is constructed using exactly 
one argument. Why would I use *args and call the init of the parent 
class? I followed the example given in the python tutorial here: 
http://docs.python.org/tut/node10.html#SECTION0010500000000000000000
>   34                         "permissions. Please correct this issue then " + 
> \
>   35                         "rebuild the index."
>
> Perhaps better worded as:
>
> permissions.  Please correct this issue and then
> restart with --rebuild-index.
>
> ? My intent here is to tell them how to rebuild the index instead of
> just telling them to.
>   
No, because this could happen on either the server or the client side. 
The specific instructions on how to do it are included in the client 
error handling code (and could be added to the server code, but we tend 
to make stronger assumptions that those running a depot know what 
they're doing).
>   37
>
> Remove this line.
>
> In general, why use "+" instead of %s, etc. for the exception messages?
>   
Because I find %s much harder to read and understand than +.

Side note:
I find having to dart back and forth between the middle of a string and 
the end more difficult than simply using +'s to split the string and 
include the variables internally. To use an example from C/C++ world, I 
found the C++ io streams using << much, MUCH easier and intuitive to 
work with than printf/scanf from a user interface perspective (setting 
aside any weirdness that C++ did with << >> which I might be forgetting 
right now and which isn't relevant to the point.) I much prefer having 
my information in left to right order and in place.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/search_storage.py.html
> ==========
>
>  261         def get_entry_count(self):
>  262                 """ Returns the number of entities stored. """
>  263                 return 0
>
> Maybe this should RaiseError instead of returning 0 to ensure derived
> classes provide this?
>   
I have several derived classes which use this to commonly return 0. If 
you'd prefer I raise an exception here and duplicate the code in a 
couple of places, I will.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/repository.py.wdiff.html
> ==========
>       185 +                        output += ("%s %s %s %s\n" % (l[0],
> l[1], l[2], l[3]))
>
> Why not something like:
>
> output += "%s\n" % ' '.join(l) ?
>
> Is it safe to print all of the elements of l?
>   
I think it's clearer this way. It shows that exactly 4 things are being 
printed (which is what the client side code is expecting). Also, it 
makes clear that the things in l are ordered and that the order is 
important for the client.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/server_catalog.py.html
> ==========
> Why isn't this just modules/server/catalog.py?
>
>   39
>   
See above comment about client_query_engine
>  132                                                 emsg("CHILD
> DIED:\n" + str(e) +
>  133                                                      "\n\nrepr:" + 
> repr(e))
>
> Why not "CHILD DIED:\n%s\n\nrepr: %s" % (str(e), repr(e))?
>   
See above comments about + vs %s
>  141                                 if not self._search_available:
>  142                                         cherrypy.log("Search Available",
>  143                                             "INDEX")
>  144                                 self._search_available = True
>
> This is a little confusing to me. Can you add a comment explaining why
> you're logging "Search Available" when it isn't and why you
> immediately force it to true either way?
>
>   
It is available. Setup makes it available. We're here if there's nothing 
new to index.
I've added this comment.
                                # Since there is nothing to index, setup
                                # the index and declare search available.
                                # We only log this if this represents
                                # a change in status of the server.
>  197                 if (not self.searchdb_update_handle) or \
>  198                     (self.searchdb_update_handle.poll() == None):
>
> I don't think you need the parentheses.
>   
I find it much clearer with the parens. I'd rather not require a person 
reading the code to carry around the entire order of operations for 
logic functions to accurately understand the code. If you feel strongly, 
I'll use temporary variables and rewrite the if statement, or do 
something else similar.
>  224                             "ERROR building search database, rc:"
> + str(rc)+":"
>
> Why not:
>
> "ERROR building search database, rc: %s" % str(rc)
>   
See above
>  234                 # It appears that there's a roughly factor of 7 blow up
>  235                 # in going from file to internal representation
> of manifests
>  236
>  237                 # storing search_dicts causes a blowup of 3-4
>  238
>
> These seem like random factoids. Can you reword this into a single paragraph?
>   
I'll remove them since they seem random.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/server_query_engine.py.html
> ==========
> Why isn't this just modules/server/query_engine.py?
>
>   
see above
>
>       188 +                self.cfg.catalog.refresh_index()
>
> I'm somewhat concerned about the effect on a distro import using
> solaris.py or any other bulk pkgsend operations that this is going to
> have.
>
> I think we would need the ability to defer any index rebuild until a
> depot restart or something similar.
>   
I'm open to suggestions on where to put this, but I don't understand the 
issue really. Is your concern that this will slow down a distro import, 
or that it will break a distro import? Why would you want to defer an 
index rebuild when you can asynchronously update the index while waiting 
for more data to come through the pipe?
> http://cr.opensolaris.org/~bpytlik/ips-search/src/tests/cli/t_search.py.html
> ==========
>
>  160                         f.write(p)
>
> Possibly include a newline along with the pathname?
>   
Sure, why?
>  183                 for line in self.example_pkg10.split("\n"):
>
> pkgsend_bulk should do what you want here.
>
>   
huh, didn't know pkgsend_bulk existed
>  197                 # set to 1 because can't currently search on
>
> s/set/Set/
>
> s/because can't/since searches, currently, cannot be performed using/
>   
                # Set to 1 since searches can't currently be performed
                # package name unless it's set inside the
                # manifest which happens at install time on
                # the client sid
>   
Thanks again for taking a look Shawn.
I'll be posting an updated webrev in a little bit once I verify I 
haven't broken anything and synced up with the gate.

Brock
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Re: [pkg-discuss] Code Review request: 1949, 983, 2122

Reply via email to