Hi Shawn, thanks taking a look.
If I've snipped a suggestion, that means "ok, I've made that change"
Comments in line.
General comment:
563 + fmri_dir_path = os.path.join(self.imgdir, "pkg",
564 + fmri.get_dir_path())
Indentation for line 564 should be four spaces in from the previous
line as far as I know.
That indentation follows the PEP8 standard as far as I can tell. The
indentation we're using in general is non-standard. I've changed to
match the existing code where you've noted it, but to be clear, this
isn't PEP8. Other than the 8 spaces issue, are we moving to PEP8 or
continuing to use our own standards? (Another place we're inconsistent
with the standard is that the standard says """ to close a block
docstring should have its own line and a blank line before it.)
Shawn Walker wrote:
> 2008/7/13 Brock Pytlik <[EMAIL PROTECTED]>:
>
>> Another webrev for search is up, based on suggestions from Dan, the
>> memory requirements have been curtailed quite a bit. Before indexing,
>> the process seems to top out at around 116M. The indexing then takes
>> that to 133M (instead of the 198M it used to require).
>>
>> The revised webrev is here:
>>
>> http://cr.opensolaris.org/~bpytlik/ips-search/
>>
>
> http://cr.opensolaris.org/~bpytlik/ips-search/src/client.py.wdiff.html
> =========
> 1419 + quiet = False
>
> This doesn't seem useful given that it will always be False.
>
> Was the intent here to respect or have a verbose option?
>
All the other similar functions which have a progress tracker have this
same variable set (see verify, install, image-update, etc...) so I used
the same idiom used elsewhere in the code.
>
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/client_query_engine.py.html
> ==========
> Why is this called client_query_engine instead of just query_engine
> given that it's in the client folder?
>
>
Because I find the naming convention easier to understand and clearer if
the module includes whether it's server or client (especially when,
server, client, and neither versions all exist) and I can't see any harm
in having it there. (Basically, client may be in the import, but if I'm
working with the code, and I see Catalog, or QueryEngine, I'm heading to
the modules named that, and when there are two modules named X, I tend
to go to the general one, which may or not be the one I actually want to
look at.)
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/image.py.wdiff.html
> ==========.
>
> 1174 + query = query_e.Query(args[0])
> 1175 + return qe.search(query)
>
> Why assign to query instead of just pass to qe.search?
>
For clarity? I could also do "return
query_e.ClientQueryEngine(self.index_dir).search(query_e.Query(args[0]))"
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/imageplan.py.wdiff.html
> ==========
>
> 525 + self.progtrack.actions_set_goal("Index
> Phase", len(plan_info))
>
> Can you use a constant for the phase instead of hard-coding it?
>
I could. I could put a constant INDEX_PHASE = "Index Phase" at the top
of the file and then use it here. It's not repeated anywhere so there's
no commonality. Is this really preferable over having the string inline?
If so I'll change it.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/progress.py.wdiff.html
> ==========
>
I think I have to ask why for these. I followed the standards in the
nearby code.
> 73 + self.ind_phase = "None"
> 74 + self.ind_phase_last = "None"
>
>
This I'm not going to change as it's a functional change and I only have
a vague understanding of the issues involved. It's also not clear to me
whether the action code and the index code should have the same timing
or not. So, I'll pull the index one out into a constant, and leave that
decision for whoever wants to revamp this module overall. As a side
note. I think what's really needed here is a generic update function,
rather than specific ones, but that wasn't in the scope for this set of
changes.
> 577 + if force or (time.time() -
> self.last_print_time) >= 0.05:
>
> Hrm. I'm not keen on the hard-coded 0.05. This seems like it
> introduces a possible timing issue. Could you use a constant here at
> the least?
>
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/indexer.py.html
> ==========
> Lines 50, 57, 64 and 67 should probably just start with a '#' to be
> consistent with general style.
>
> While your comments about the file format are helpful, I would prefer
> to see a sample format first followed by a description of the sample.
> That might allow you to remove a good portion of the text by relying
> on the example to explain the format.
>
# Here is an example of a line from the main dictionary, it is explained
below:
# %gconf.xml (5,3,65689 => 249,202) (5,3,65690 => 249,202) (5,3,65691 =>
249,202)
# (5,3,65692 => 249,202)
#
# The main dictionary has a more complicated format. Each line begins with a
# search token (%gconf.xml) followed by a list of mappings. Each mapping
takes
# a token_type, action, and keyvalue tuple ((5,3,65689), (5,3,65690),
# (5,3,65691), (5,3,65692)) to a list of pkg-stem, version pairs
(249,202) in
# which the token is found in an action with token_type, action, and
keyvalues
# matching the tuple. Further compaction is gained by storing everything but
# the token as an id which the other dictionaries can turn into
human-readable
# content.
Here's what I changed it to. Is that what you had in mind?
> 109 self._max_mem_use =
> float(os.environ.get("PKG_INDEX_MAX_RAM",
>
> You refer to this as max_mem everywhere in the code, so it would make
> more sense for the environment variable to be PKG_INDEX_MAX_MEM as
> well for consistency.
>
Actually, I think it makes more sense to make the change in the other
direction. I've changed mem to ram.
> 146 self._tmp_dir = os.path.join(self._index_dir, "TMP")
>
> Personal preference: why uppercase for the directory name?
>
Because it makes it stand out more, makes it clear it's a temporary
directory. I prefer it over tmp.
> 168 "Reading Existing Index",
> len(self._data_dict))
>
> I would like to see constants used for goals.
>
See my above question, why?
> 256 """ Estimates RAM used based on size of added and
> 257 removed dictionaries. It returns an estimate in KB.
> 258 """
>
> Nitpick: s/RAM/memory/ for consistency.
>
> Nitpick: s/estimate in KB/estimated size in KiB/ :-)
>
Just curious, why KiB? I haven't seen that used.
> 259 return 0.5892 * dict_size + -0.12295 * ids + \
> 260 -0.009595 * total_terms + 23512
>
> It would be nice to see these numbers as constants. On a side note, I
> know that these numbers worked, but "magic values" always bother me...
>
I'm going to leave these as is, b/c this is really a hack that should
die. I think leaving them in the code encourages someone to think very
carefully about what they're doing when they change them instead of
viewing them as constants to be tweaked. In other words, these ARE magic
values. To me, they're qualitatively different than the TIMOUT_SECS (for
example) and promoting them to constants at the top of the file gives
them a status they shouldn't have. It would also mean, other code could
potentially get at these numbers, something else I think we want to
avoid without VERY careful consideration. On a side note, I continue to
be open to suggestions on how to do this in Python 2.4 better. IIRC, Dan
mentioned that Python 2.6 (I think) has a way to determine the size of a
structure.
> Is there a way to tie these numbers to the current version of the
> index format such that they will become invalidated the next time we
> change it? (In other words, force us to either review or update these
> numbers when we change the format.)
>
I'm open to suggestions. And, actually, it's not dependent on the index
format. It's dependent on the structures used to hold that information
in memory, so tying it to the index format (which to me means on disk
format) isn't the right thing to do.
> 441 """ Process the main dictionary file and writes out a new
> 442 main dictionary file reflecting the changes in
> the packages.
> 443 """
>
> Wording is a bit off. Suggested text:
>
> Processes the main dictionary file and writes out a new
> one reflecting any package changes.
>
I've gone with:
Processes the main dictionary file and writes out a new
main dictionary file reflecting the changes in the packages.
as I really prefer "changes in the packages" over "any package changes"
(which sounds to me like it only applies to changes in which packages
are installed, not changes to the installed packages)
> 622 raise RuntimeError("Got unknown
> input_type: " +
> 623 str(input_type))
>
> Use %s instead?
>
Changed but why? I find %s harder to read than + str(bleh)
> 676 except Exception, e:
>
> You don't use the e here.
>
> 676 except Exception, e:
> 677 res = False
> ...
> 681 if res == 0:
> 682 res = True
> 683 return res
>
> There's a possible bug here. If an exception happened and res was set
> to False, line 681 will cause it to be set to True. Python considers 0
> and False to be equivalent.
>
> However, you could do this instead which is shorter anyway and avoids
> that possible issue (assuming you didn't intend that):
> return res and True or False
>
>
Changed to this:
try:
try:
res = \
ss.consistent_open(self._data_dict.values(),
self._index_dir)
except Exception:
return False
finally:
for d in self._data_dict.values():
d.close_file_handle()
if res == 0:
res = True
return res
> 731 if self.file_version_number:
> 732 raise RuntimeError("Got file_version_number
> other"
> 733 "than None in setup.")
>
> This looks like it should be the first thing done before line 720.
>
Not necessarily. If setup is called and the files are present, the
behavior would change and that change would break code in other locations.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/query_engine.py.html
> ==========
>
> 88 }
>
> I think you need to de-indent this by 8 spaces.
>
When it's not obvious to me, I trust emacs, but it's been changed.
> 91 """ Searches the indexes in dir_path for any
> matches of query
> 92 and returns the results. The method assumes the
> dictionaries
> 93 have already been loaded and read appropriately.
> 94 """
>
> A comment about the return value would be nice.
>
There isn't a return value for this function.
>
> 149 # Construct the answer for the search_0 format
>
> Possible wording:
>
> Construct the answer for version 0 of the search format.
>
I like search_0 better as it indicates exactly what the relevant methods
are.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/search_errors.py.html
> ==========
> I'm not keen on having a separate search errors class. You're also
> missing a CDDL header.
>
Header's added. I'm VERY much in favor of specific error classes. I
don't care much which file they live in though. It seemed silly to put
them in misc. I put them in a separate module because they're sometimes
imported when the indexer and query_engine modules are not imported.
Maybe we should have an errors.py or pkg_errors.py which has all the
cross module exceptions.
> It seems like this class could be merged with another one or we need a
> new subdirectory for search.
>
I'm fine with putting search into a subdirectory. I'm not sure why this
tied to having a separate search errors class.
> 1 # __str__ methods defined for subclasses of IndexError should be defined
> 2 # for the server implementations. If the client needs different messages
> 3 # displayed, catch the exception on the client side and display a
> custom message
>
> Missing punctuation. Could change this to """ instead of # I think.
>
I really think this is a comment, not a docstring for the module.
> 6 def __init__(self, cause):
> 7 self.cause = cause
>
> "Python in a Nutshell: A Desk Reference" recommends the following
> syntax instead:
>
> def __init__(self, *args):
> exceptions.Exception.__init__(self, *args)
>
> In addition, a discussion from last year indicates that the Python
> developers (Gudio, et al.) believe that a recommendation to not rely
> on self.args is appropriate. Instead, explicitly named arguments are
> the preference [1].
>
> In that sense, what you have is good by using "cause", etc. but would
> be better as a named parameter (e.g. cause = None?).
>
Sorry, you've lost me here. This exception is constructed using exactly
one argument. Why would I use *args and call the init of the parent
class? I followed the example given in the python tutorial here:
http://docs.python.org/tut/node10.html#SECTION0010500000000000000000
> 34 "permissions. Please correct this issue then " +
> \
> 35 "rebuild the index."
>
> Perhaps better worded as:
>
> permissions. Please correct this issue and then
> restart with --rebuild-index.
>
> ? My intent here is to tell them how to rebuild the index instead of
> just telling them to.
>
No, because this could happen on either the server or the client side.
The specific instructions on how to do it are included in the client
error handling code (and could be added to the server code, but we tend
to make stronger assumptions that those running a depot know what
they're doing).
> 37
>
> Remove this line.
>
> In general, why use "+" instead of %s, etc. for the exception messages?
>
Because I find %s much harder to read and understand than +.
Side note:
I find having to dart back and forth between the middle of a string and
the end more difficult than simply using +'s to split the string and
include the variables internally. To use an example from C/C++ world, I
found the C++ io streams using << much, MUCH easier and intuitive to
work with than printf/scanf from a user interface perspective (setting
aside any weirdness that C++ did with << >> which I might be forgetting
right now and which isn't relevant to the point.) I much prefer having
my information in left to right order and in place.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/search_storage.py.html
> ==========
>
> 261 def get_entry_count(self):
> 262 """ Returns the number of entities stored. """
> 263 return 0
>
> Maybe this should RaiseError instead of returning 0 to ensure derived
> classes provide this?
>
I have several derived classes which use this to commonly return 0. If
you'd prefer I raise an exception here and duplicate the code in a
couple of places, I will.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/repository.py.wdiff.html
> ==========
> 185 + output += ("%s %s %s %s\n" % (l[0],
> l[1], l[2], l[3]))
>
> Why not something like:
>
> output += "%s\n" % ' '.join(l) ?
>
> Is it safe to print all of the elements of l?
>
I think it's clearer this way. It shows that exactly 4 things are being
printed (which is what the client side code is expecting). Also, it
makes clear that the things in l are ordered and that the order is
important for the client.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/server_catalog.py.html
> ==========
> Why isn't this just modules/server/catalog.py?
>
> 39
>
See above comment about client_query_engine
> 132 emsg("CHILD
> DIED:\n" + str(e) +
> 133 "\n\nrepr:" +
> repr(e))
>
> Why not "CHILD DIED:\n%s\n\nrepr: %s" % (str(e), repr(e))?
>
See above comments about + vs %s
> 141 if not self._search_available:
> 142 cherrypy.log("Search Available",
> 143 "INDEX")
> 144 self._search_available = True
>
> This is a little confusing to me. Can you add a comment explaining why
> you're logging "Search Available" when it isn't and why you
> immediately force it to true either way?
>
>
It is available. Setup makes it available. We're here if there's nothing
new to index.
I've added this comment.
# Since there is nothing to index, setup
# the index and declare search available.
# We only log this if this represents
# a change in status of the server.
> 197 if (not self.searchdb_update_handle) or \
> 198 (self.searchdb_update_handle.poll() == None):
>
> I don't think you need the parentheses.
>
I find it much clearer with the parens. I'd rather not require a person
reading the code to carry around the entire order of operations for
logic functions to accurately understand the code. If you feel strongly,
I'll use temporary variables and rewrite the if statement, or do
something else similar.
> 224 "ERROR building search database, rc:"
> + str(rc)+":"
>
> Why not:
>
> "ERROR building search database, rc: %s" % str(rc)
>
See above
> 234 # It appears that there's a roughly factor of 7 blow up
> 235 # in going from file to internal representation
> of manifests
> 236
> 237 # storing search_dicts causes a blowup of 3-4
> 238
>
> These seem like random factoids. Can you reword this into a single paragraph?
>
I'll remove them since they seem random.
> http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/server_query_engine.py.html
> ==========
> Why isn't this just modules/server/query_engine.py?
>
>
see above
>
> 188 + self.cfg.catalog.refresh_index()
>
> I'm somewhat concerned about the effect on a distro import using
> solaris.py or any other bulk pkgsend operations that this is going to
> have.
>
> I think we would need the ability to defer any index rebuild until a
> depot restart or something similar.
>
I'm open to suggestions on where to put this, but I don't understand the
issue really. Is your concern that this will slow down a distro import,
or that it will break a distro import? Why would you want to defer an
index rebuild when you can asynchronously update the index while waiting
for more data to come through the pipe?
> http://cr.opensolaris.org/~bpytlik/ips-search/src/tests/cli/t_search.py.html
> ==========
>
> 160 f.write(p)
>
> Possibly include a newline along with the pathname?
>
Sure, why?
> 183 for line in self.example_pkg10.split("\n"):
>
> pkgsend_bulk should do what you want here.
>
>
huh, didn't know pkgsend_bulk existed
> 197 # set to 1 because can't currently search on
>
> s/set/Set/
>
> s/because can't/since searches, currently, cannot be performed using/
>
# Set to 1 since searches can't currently be performed
# package name unless it's set inside the
# manifest which happens at install time on
# the client sid
>
Thanks again for taking a look Shawn.
I'll be posting an updated webrev in a little bit once I verify I
haven't broken anything and synced up with the gate.
Brock
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss