I spent more time on this problem. I don't see an easy way out of the current problem.
I can define the problem as this: When I view a page of a given package, I want to see its main metadata, such as pkgmap, and the output of the dump utility to see which binaries depend on which sonames. For example: http://buildfarm.opencsw.org/pkgdb/srv4/1bd915f6cdbf1217addd0e6d28823dff/ This page is currently forced to display the following depressing message: "As of January 2013, the stats stored are so big that processing them can take several minutes before they can be served. Disabling until a proper solution is in place." If you try enabling showing of metadata again, you just get a timeout. Current state of affairs means that it's much harder to review and diagnose packages than it needs to be. Things that used to take 10 seconds to check now take 5 to 10 minutes and require shell access: you can't packages by using a web browser. I'm unhappy about this. An obvious solution is to keep the elfdump and ldd data in a separate place, and not include it with the main bulk of metadata. It sounds easier than it looks. The core problem is this: We are no longer capable of keeping a single package's metadata in RAM to analyze them. We might be on the buildfarm, but our longer-term plan is to allow other people with smaller hardware to have their own buildfarm. I'm using a virtual machine with 1.6GB of RAM as a reference. I am not able to index our catalogs on our machine, it just fails because of insufficient RAM. Our package checking code has a nice and simple API: you define a function, which gets your packge's metadata and a few interaction objects it uses to report errors: https://sourceforge.net/apps/trac/gar/browser/csw/mgar/gar/v2/lib/python/package_checks.py Here's a showcase of a check that verifies that a package must not depend on itself: def CheckDependsOnSelf(pkg_data, error_mgr, logger, messenger): pkgname = pkg_data["basic_stats"]["pkgname"] for depname, dep_desc in pkg_data["depends"]: if depname == pkgname: error_mgr.ReportError("depends-on-self") It is a really simple API, because pkg_data is just a data structure deserialized from JSON, the same one you can get via REST, which is a simple HTTP GET you can do with curl or anything else that can talk over HTTP. There are no mysteries, lazy evaluations. You just see what the data structure is, and you can traverse it for whatever data you need. With the current amount of data, we cannot have a simple pkg_data any more. We'll have to switch to something doing lazy evaluation, and we are running a risk that a check function can leak memory and cause checkpkg to crash. Of course, we can implement this, but I can already hear people saying "this is so complicated, why are you preventing me from writing checks?" The new API would have to look something like this: def CheckDependsOnSelf(data_access_object, error_mgr, logger, messenger): pkgname = data_access_object.get("basic_stats")["pkgname"] for depname, dep_desc in data_access_object.get("depends"): if depname == pkgname: error_mgr.ReportError("depends-on-self") It doesn't look that different, but it is different in that instead of accessing a normal dict/list data structure, you're calling an object, which makes REST queries under the hood, and generally does who knows what. I don't have a good solution. Any ideas? Maciej _______________________________________________ maintainers mailing list [email protected] https://lists.opencsw.org/mailman/listinfo/maintainers .:: This mailing list's archive is public. ::.
