2008/7/13 Brock Pytlik <[EMAIL PROTECTED]>: > Another webrev for search is up, based on suggestions from Dan, the > memory requirements have been curtailed quite a bit. Before indexing, > the process seems to top out at around 116M. The indexing then takes > that to 133M (instead of the 198M it used to require). > > The revised webrev is here: > > http://cr.opensolaris.org/~bpytlik/ips-search/
http://cr.opensolaris.org/~bpytlik/ips-search/src/client.py.wdiff.html ========= 126 + "of pfexec when running pkg commands is often a source of this problem.)" Two spaces after pfexec. 128 +def get_partial_indexing_error_message(text): 129 + tmp = "Result of partial indexing found " + \ 130 + "Could not make: " + \ 131 + text + " because it already exists. " + \ 132 + "Please use pkg rebuild-index " + \ 133 + "to fix this problem." 134 + return tmp No reason to assign a variable here and then return it. Maybe something like this instead? return "Partial search index found. " + \ "Could not create: %s because it already " + \ "exists. Please use pkg rebuild-index to " + \ "fix this problem." % text 671 + "index") Missing punctuation. 674 + error("The search index appears corrupted. Please " 675 + "rebuild the index with pkg rebuild-index.") Should have two spaces before "Please". 1418 + """ Need to move this up to the end of 1417. 1419 + quiet = False This doesn't seem useful given that it will always be False. Was the intent here to respect or have a verbose option? 1551 + tmp = \ It would be better not to assign this to a variable. Just pass it as a argument. http://cr.opensolaris.org/~bpytlik/ips-search/src/depot.py.wdiff.html ========== 154 + # This flag is purposefully ommited in usage. s/ommited/omitted/ 155 + # The supported way to forcefully reindex is to 156 + # kill any pkg.depot using that directory. 157 + # Remove the index directory, and restart the 158 + # pkg.depot process. The index will be rebuilt 159 + # automatically. The wording is a bit choppy here. How about this? The supported way to forcefully reindex is to kill any pkg.depot using that directory, remove the index directory, and restart the pkg.depot process. The index will be rebuilt automatically on startup. http://cr.opensolaris.org/~bpytlik/ips-search/src/man/pkg.1.txt.wdiff.html ========= 228 + rebuild-index 229 + Rebuilds the index used by 'pkg search'. This is a recovery operation 230 + not intended for general use. 231 + It seems odd to document this at all and not document the option at the beginning of the text. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/actions/attribute.py.wdiff.html ========== 60 + """ since there's no install method, this class is always 61 + installed correctly 62 + """ Better as the following possibly: """Since there's no install method, this class is always installed correctly.""" 66 + """ generate the indices needed by the search dictionary. """ Probably needs to be: """Generates the indices needed by the search dictionary.""" 67 + if ((self.attrs["name"] == "description") or 68 + (' ' in self.attrs["value"])): I don't think you need the extra parentheses around the conditions. 76 + mo = self.internal_re.match(fmri) You have two spaces after the '='. 80 + self.attrs["name"]:[mo.group('pkgname'), Missing space after ':'. 97 + self.attrs["value"]:self.attrs["value"] Missing space after ':'. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/catalog.py.wdiff.html ========== 107 + self.catalog_root,"catalog")) Missing space after ','. 697 395 yield fmri.PkgFmri("[EMAIL PROTECTED]" % 396 + (cat_name, cat_version), 397 + authority = self.auth) You seem to have added a lot of indentation here for 396-397. 711 409 yield fmri.PkgFmri(cat_fmri, 712 - authority = self.auth) 410 + authority = self.auth) You seem to have added a lot of indentation here for 410. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/client_query_engine.py.html ========== Why is this called client_query_engine instead of just query_engine given that it's in the client folder? 1 #!/usr/bin/python Please ensure you always specify #!/usr/bin/python#min_version#, in this case, #!/usr/bin/python2.4. 29 TIMEOUT = 5 It would be nice if the variable indicate the unit of measure. TIMEOUT_SECS, etc.? 38 """ http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/image.py.wdiff.html ========== 455 + """ produces fmri, paths to the maifest for that fmri pairs """ Wording is bit odd here, plus manifest is misspelled. Can you clarify? 563 + fmri_dir_path = os.path.join(self.imgdir, "pkg", 564 + fmri.get_dir_path()) Indentation for line 564 should be four spaces in from the previous line as far as I know. 922 + authority = authpfx) Remove the spaces around '=' for PEP8 style. 1048 1066 self.type != IMG_USER) Four space continuation indent. 1063 1081 self.type != IMG_USER) Four space continuation indent. 1133 + counthash = counthash) Remove spaces around '=' for PEP8 style. 1161 + def update_index_dir(self, postfix = "index"): Remove spaces around '=' for PEP8 style. 1162 + """ Since it's not reliable that the index directory will be 1163 + updated when the image root is updated. This should be called 1164 + prior to using the index directory. 1165 + """ Drop the space before "Since" on line 1162 to be consistent with the comment style in the rest of this file. Two spaces instead of one before "This" on line 1163 to match style of rest of file. Move line 1165 up to end of line 1164. 1174 + query = query_e.Query(args[0]) 1175 + return qe.search(query) Why assign to query instead of just pass to qe.search? 1357 + """ Forcefully rebuilds the search indexes. Removes all existing 1358 + indexes and replaces them from scratch rather than performing 1359 + the incremental update which is usually used. 1360 + """ Same as earlier. Dump the space at the beginning of the comment. Two spaces between sentences. End of comment markers should go after the last sentence instead of on their own line. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/imageplan.py.wdiff.html ========== 397 + force_rebuild = False) Dump the spaces around '=' for PEP8. 525 + self.progtrack.actions_set_goal("Index Phase", len(plan_info)) Can you use a constant for the phase instead of hard-coding it? http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/client/progress.py.wdiff.html ========== 73 + self.ind_phase = "None" 74 + self.ind_phase_last = "None" Can you use a constant for the phase value? 576 + def ind_output(self, force = False): Drop the spaces around '='. 577 + if force or (time.time() - self.last_print_time) >= 0.05: Hrm. I'm not keen on the hard-coded 0.05. This seems like it introduces a possible timing issue. Could you use a constant here at the least? http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/indexer.py.html ========== Lines 50, 57, 64 and 67 should probably just start with a '#' to be consistent with general style. While your comments about the file format are helpful, I would prefer to see a sample format first followed by a description of the sample. That might allow you to remove a good portion of the text by relying on the example to explain the format. 97 # used as arguments Missing punctuation. 98 PKGS = True 99 FMRIS = False The names and values of these constants could be clearer as to their purpose. Assigning them true and false values implies that they can be evaluated as booleans, which I don't think is the intent. I would suggest: IDX_INPUT_TYPE_PKG = 0 IDX_INPUT_TYPE_FMRI = 1 109 self._max_mem_use = float(os.environ.get("PKG_INDEX_MAX_RAM", You refer to this as max_mem everywhere in the code, so it would make more sense for the environment variable to be PKG_INDEX_MAX_MEM as well for consistency. 146 self._tmp_dir = os.path.join(self._index_dir, "TMP") Personal preference: why uppercase for the directory name? 157 """ Private method for building versions from a string """ Missing punctuation. 161 """ Opens all index files consistently and reads all but 162 the main dictionary file (as storing that one in memory 163 isn't efficient. 164 """ You are missing a closing parentheses and you might want to consider the following wording for clarity: Opens all index files using consistent_open and reads all of them into memory except the main dictionary file to avoid inefficient memory usage. 168 "Reading Existing Index", len(self._data_dict)) I would like to see constants used for goals. 198 """ Adds the terms in new_dict to the dictionary as pointers to 199 added_fmri. 200 """ Should this be s/the dictionary/added_dict/ ? It would also be nice if it indicated what it returns. 203 # dict -> dict -> set. This arrangement wasted an enourmous s/enourmous/enormous/ 205 # dictionarys and third level sets. That structure was replaced s/dictionarys/dictionaries/ 220 # list being written to. Two spaces before list instead of one. Can you break up the comment on lines 202-220 into a few more paragraphs? 256 """ Estimates RAM used based on size of added and 257 removed dictionaries. It returns an estimate in KB. 258 """ Nitpick: s/RAM/memory/ for consistency. Nitpick: s/estimate in KB/estimated size in KiB/ :-) 259 return 0.5892 * dict_size + -0.12295 * ids + \ 260 -0.009595 * total_terms + 23512 It would be nice to see these numbers as constants. On a side note, I know that these numbers worked, but "magic values" always bother me... Is there a way to tie these numbers to the current version of the index format such that they will become invalidated the next time we change it? (In other words, force us to either review or update these numbers when we change the format.) 294 # The pkg plan for a newly package added has an origin s/newly package added/newly added package/ ? 413 remove_action_ids, remove_keyval_ids, remove_tok_type_ids, 414 remove_fmri_ids, remove_version_ids): Should be indented four spaces from line 412. 416 token offsets to _data_token_offset Missing punctuation. 441 """ Process the main dictionary file and writes out a new 442 main dictionary file reflecting the changes in the packages. 443 """ Wording is a bit off. Suggested text: Processes the main dictionary file and writes out a new one reflecting any package changes. 453 if self.file_version_number == None: A newline before here would be nice. 456 self.file_version_number = self.file_version_number + 1 Why not += 1? 460 # It's opened in append mode to avoid blowing away the version s/It's/The dictionary file is/ 461 # information the search storage class added Missing punctuation. 509 # information for that token Missing punctuation. 574 dictionary into human readable information """ Missing punctuation. 583 """ Performs all the steps needed to update the indexes Missing punctuation. 596 # case, throw an exception and quit. Is the "and quit" actually happening here? Isn't it sufficient to say that an exception was raised? 622 raise RuntimeError("Got unknown input_type: " + 623 str(input_type)) Use %s instead? 634 if self._progtrack is not None: 635 self._progtrack.index_done() Is it really done when we haven't written out the companion dictionaries or finished moving files yet? It seems like this should be the very last thing we do. I know that it's technically "done building the index" ... yet, it still seems odd. 651 accidentally blowing away files. Personal preference: s/blowing away/removing/ 660 takes a list of fmri's to be added to the repot. Currently, the s/fmri's/fmris/ or s/fmri's/FMRIs/ 662 from the depot, and reindex. Note: if tmp_index_dir is Nitpick: s/, and/ and/ 664 This prevents the indexer from accidentally blowing away files. Personal preference: s/blowing away/removing/ 670 """ Check to see if a consistent index exists """ Suggested wording: Returns a boolean value indicating whether a consistent index exists. 676 except Exception, e: You don't use the e here. 676 except Exception, e: 677 res = False ... 681 if res == 0: 682 res = True 683 return res There's a possible bug here. If an exception happened and res was set to False, line 681 will cause it to be set to True. Python considers 0 and False to be equivalent. However, you could do this instead which is shorter anyway and avoids that possible issue (assuming you didn't intend that): return res and True or False 727 if absent and present: 728 raise search_errors.InconsistentIndex(self._index_dir) Seems like this could be moved up inside the for loop started on line 720 as an early out condition. 731 if self.file_version_number: 732 raise RuntimeError("Got file_version_number other" 733 "than None in setup.") This looks like it should be the first thing done before line 720. 757 return fmri_set Given that python is pass-by-reference, is this necessary? If so, update the comment to document the return value on line 742. 761 permanent one""" s/"""/. """/ http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/manifest.py.wdiff.html ========== 285 283 """ Store the manifest contents and the pickled index 286 284 of the manifest (used for searching) to disk. 287 285 """ Need to update the comment and remove the pickle references. Can you also move 284 up to 283? http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/misc.py.wdiff.html ========== 331 + SERVER_DEFAULT_MEM_USE = (phys_pages / 1024.0) * page_size / 3 332 + CLIENT_DEFAULT_MEM_USE = SERVER_DEFAULT_MEM_USE / 2.0 333 + 334 +except: 335 + CLIENT_DEFAULT_MEM_USE = 100 336 + SERVER_DEFAULT_MEM_USE = 500 It would be nice if these variables included the unit of measure in their name. CLIENT_DEFAULT_MEM_USE_KIB for example. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/query_engine.py.html ========== 1 #!/usr/bin/python Please ensure you always specify #!/usr/bin/python#min_version#, in this case, #!/usr/bin/python2.4. 46 class QueryEngine: Add a newline before this if you would. You're missing the comment you have for this in indexer.py on lines 112-117. 81 'fmri' : set(), 82 'action' : set(), 83 'tok_type' : set(), 84 'version' : set(), 85 'keyval' : set(), 86 'main_dict' : set(), 87 'token_byte_offset' : set() I believe these lines should be indented with four spaces and there should be no space before each ':'. 88 } I think you need to de-indent this by 8 spaces. 91 """ Searches the indexes in dir_path for any matches of query 92 and returns the results. The method assumes the dictionaries 93 have already been loaded and read appropriately. 94 """ A comment about the return value would be nice. 146 results of the search. """ You need to move the """ down to the next line to be consistent with the rest of the file. In general, we don't seem to have a consistent style for comment end markers. 149 # Construct the answer for the search_0 format Possible wording: Construct the answer for version 0 of the search format. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/search_errors.py.html ========== I'm not keen on having a separate search errors class. You're also missing a CDDL header. It seems like this class could be merged with another one or we need a new subdirectory for search. Other general comment: line continuation indent is four spaces. 1 # __str__ methods defined for subclasses of IndexError should be defined 2 # for the server implementations. If the client needs different messages 3 # displayed, catch the exception on the client side and display a custom message Missing punctuation. Could change this to """ instead of # I think. 6 def __init__(self, cause): 7 self.cause = cause "Python in a Nutshell: A Desk Reference" recommends the following syntax instead: def __init__(self, *args): exceptions.Exception.__init__(self, *args) In addition, a discussion from last year indicates that the Python developers (Gudio, et al.) believe that a recommendation to not rely on self.args is appropriate. Instead, explicitly named arguments are the preference [1]. In that sense, what you have is good by using "cause", etc. but would be better as a named parameter (e.g. cause = None?). 34 "permissions. Please correct this issue then " + \ 35 "rebuild the index." Perhaps better worded as: permissions. Please correct this issue and then restart with --rebuild-index. ? My intent here is to tell them how to rebuild the index instead of just telling them to. 37 Remove this line. In general, why use "+" instead of %s, etc. for the exception messages? http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/search_storage.py.html ========== 40 It retries several times in case it has bad luck with hitting 41 the files in the middle of a migration. Perhaps better worded as: It retries several times in case a race condition between file migration and open is encountered. 60 # if it opens, and it's the first file, tried s/if/If/ 82 # all files and try again Missing punctuation. 104 # The index is empty (ie, no files were present) Missing punctuation. 150 Note: This funquction should be considered protected even s/funquction/function/ 152 in python. In otherwords, this function should only be called s/otherwords/other words/ 165 to (heurisitcally) determine whether the file backing this s/heurisitcally/heuristically/ 205 to or from this file_handle may result unexpected s/result/result in/ 212 """ Parses one line of a main dictionary file s/file/file./ 226 fmri_ids[len(fmri_ids)-1] = \ 227 fmri_ids[len(fmri_ids)-1].strip(')') Missing a space before and after '-'. 261 def get_entry_count(self): 262 """ Returns the number of entities stored. """ 263 return 0 Maybe this should RaiseError instead of returning 0 to ensure derived classes provide this? 268 use_dir. If it has done this previously, it blows away the old Personal preference: s/blows away/removes/. 294 def __init__(self, file_name, build_function = None): Remove spaces around '=' for PEP8. 406 if (line == '\n'): Drop the parentheses. 585 """ Reads the file and removes all frmi's in the file s/frmi's/fmris/ 591 def get_entry_count(self): Add a newline before this. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/config.py.wdiff.html ========== 131 + def acquire_catalog(self, rebuild = True): Drop spaces around '=' for PEP8. 135 + self.catalog = catalog.ServerCatalog(self.cat_root, 136 + pkg_root = self.pkg_root, read_only = self.read_only, 137 + index_root = self.index_root, repo_root = self.repo_root, 138 + rebuild = rebuild) Drop spaces around '=' for PEP8. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/repository.py.wdiff.html ========== 185 + output += ("%s %s %s %s\n" % (l[0], l[1], l[2], l[3])) Why not something like: output += "%s\n" % ' '.join(l) ? Is it safe to print all of the elements of l? http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/server_catalog.py.html ========== Why isn't this just modules/server/catalog.py? 39 A class comment here would be nice. 82 """ This function refreshes the search indexes if any new s/if any/if there are any/ 84 run_update_index (see below) which does the actual update Missing punctuation. 132 emsg("CHILD DIED:\n" + str(e) + 133 "\n\nrepr:" + repr(e)) Why not "CHILD DIED:\n%s\n\nrepr: %s" % (str(e), repr(e))? 141 if not self._search_available: 142 cherrypy.log("Search Available", 143 "INDEX") 144 self._search_available = True This is a little confusing to me. Can you add a comment explaining why you're logging "Search Available" when it isn't and why you immediately force it to true either way? 155 one to through an exception. s/through/raise/ 172 Remove this line. 173 def build_catalog(self): Function comment would be nice. 185 """Handler method for the SIGCLD signal. Checks to see if the 186 search database update child has finished, and enables searching 187 if it finished successfully, or logs an error if it didn't.""" Your other comments have a space after """ at the beginning and before """ at the end. Change this one for consistency. 192 "while in child handler, this shoudn't happen") s/shouldn't/shouldn't/ 193 # If there's not update_handle, then another subprocess was s/not/no/ 197 if (not self.searchdb_update_handle) or \ 198 (self.searchdb_update_handle.poll() == None): I don't think you need the parentheses. 204 cherrypy.log("Search indicies updated and available", s/indicies/indices/ Missing punctuation. 224 "ERROR building search database, rc:" + str(rc)+":" Why not: "ERROR building search database, rc: %s" % str(rc) 227 """ Takes a fmri_list and calls the indexer with list of fmri s/list/a list/ 234 # It appears that there's a roughly factor of 7 blow up 235 # in going from file to internal representation of manifests 236 237 # storing search_dicts causes a blowup of 3-4 238 These seem like random factoids. Can you reword this into a single paragraph? 239 # Rather than storing those, simply pass along the 240 # file simply pass along the file path and have the "simply pass" is stated twice in succession. http://cr.opensolaris.org/~bpytlik/ips-search/src/modules/server/server_query_engine.py.html ========== Why isn't this just modules/server/query_engine.py? 1 #!/usr/bin/python Please ensure you always specify #!/usr/bin/python#min_version#, in this case, #!/usr/bin/python2.4. 31 TIMEOUT = 5 A more descriptive name for this variable would be nice. 66 and returns the results Missing punctuation. 188 + self.cfg.catalog.refresh_index() I'm somewhat concerned about the effect on a distro import using solaris.py or any other bulk pkgsend operations that this is going to have. I think we would need the ability to defer any index rebuild until a depot restart or something similar. http://cr.opensolaris.org/~bpytlik/ips-search/src/tests/cli/t_search.py.html ========== There are also an inconsistent number of newlines between the functions in this class. 1 #!/usr/bin/python Please ensure you always specify #!/usr/bin/python#min_version#, in this case, #!/usr/bin/python2.4. 158 # write the name of the file into the file, so that s/write/Write/ 159 # all files have differing contents Missing punctuation. 160 f.write(p) Possibly include a newline along with the pathname? 177 print "Correct Answer : " + str(correct_answer) 178 print "Proposed Answer: " + str(proposed_answer) 179 assert correct_answer == proposed_answer Personal preference: reverse the order these are printed in. 183 for line in self.example_pkg10.split("\n"): pkgsend_bulk should do what you want here. 197 # set to 1 because can't currently search on s/set/Set/ s/because can't/since searches, currently, cannot be performed using/ 200 # the client side Missing punctuation. 201 self.pkg("search -r example_pkg", exit = 1) Drop spaces around '='. 206 res_list = (open(outfile, 'rb')).readlines() ... 210 res_list = (open(outfile, 'rb')).readlines() ... 214 res_list = (open(outfile, 'rb')).readlines() ... 218 res_list = (open(outfile, 'rb')).readlines() ... 222 res_list = (open(outfile, 'rb')).readlines() ... 226 res_list = (open(outfile, 'rb')).readlines() ... 230 res_list = (open(outfile, 'rb')).readlines() ... 234 res_list = (open(outfile, 'rb')).readlines() You use single quotes around the open mode here, but in many other places use double quotes. The code in lines 205-235 can probably be condensed to something like this: def test_remote_search_op(desc, outfile, test_value): self.pkg("search -r %s > %s" % (desc, outfile)) res_list = (open(outfile, "rb")).readlines() self._check(set(res_list), test_value) test_remote_search_op("example_path", outfile, self.res_remote_path) test_remote_search_op("example*", outfile, self.res_remote_wildcard) ...and so on. 292 self.pkg("search a_non_existant_token", exit = 1) 293 self.pkg("search a_non_existant_token", exit = 1) 305 self.pkg("search example_path", exit = 1) Drop spaces around '=' for PEP8. Lines 301-317 could possibly be consolidated just as described above for lines 205-235. 302 res_list = (open(outfile, 'rb')).readlines() ... 308 res_list = (open(outfile, 'rb')).readlines() ... 312 res_list = (open(outfile, 'rb')).readlines() ... 316 res_list = (open(outfile, 'rb')).readlines() You use single quotes around the open mode here, but in many other places use double quotes. There are two spaces after the word "search" in each of the tests here as well. 321 self.pkg("search example_pkg", exit = 1) 322 self.pkg("search example_path", exit = 1) 323 self.pkg("search example*", exit = 1) 324 self.pkg("search /bin", exit = 1) Drop spaces around '=' for PEP8. 350 for line in self.example_pkg10.split("\n"): 351 line = line.strip() 352 if line == "": 353 continue 354 355 try: 356 self.pkgsend(durl, line) 357 except: 358 self.pkgsend(durl, "close -A") 359 raise This could seems to be repeated quite frequently in this test unit. If pkgsend_bulk doesn't work for you, consider consolidating this into a routine shared among the tests. 371 372 self.pkg("install example_pkg") It would be good to add a comment on 371 to make it clear that you are intentionally repeating the tests. In addition, since you are executing the same code twice, why not just put it into a for loop that executes twice? ========== Thanks for making these changes. I'm looking forward to improved searching. Cheers, -- Shawn Walker [1] http://mail.python.org/pipermail/python-3000/2007-July/008871.html _______________________________________________ pkg-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/pkg-discuss
