http://pypi.h1.botanicus.net/ is the same demo running behind Apache with mod_gzip on a Core i7 920.
On 22 April 2013 02:11, David Wilson <[email protected]> wrote: > Hi there, > > In a fit of madness caused by another 30 seconds-long PyPI search I > decided to investigate the code, in the hopes of perhaps finding > something simple that would alleviate the extremely long search time. > > I discovered what appears to be a function that makes 6 SQL queries > for each term provided by the search user, in turn those queries > expand to what appear to be %SUBSTRING% table scans across the > releases table, which appears to contain upwards of half a gigabyte of > strings. > > Now the root cause has been located, what to do about it? I looked at > hacking on the code, but it seems webui.py is already too massive for > its own good, and in any case PostgreSQL's options for efficient > search are quite limited. It might be all-round good if the size of > that module started to drop.. > > I wrote a crawler to pull a reasonable facsimile of the releases table > on to my machine via the XML-RPC API, then arranged for Xapian to > index only the newest releases for each package. The resulting > full-text index weighs in at a very reasonable 334mb, and searches > complete almost immediately, even on my lowly Intel Atom colocated > server. > > I wrote a quick hack Flask app around it, which you can see here: > > http://5.39.91.176:5000/ > > The indexer takes as input a database produced by the crawler, which > is smart enough to know how to use PyPI's exposed 'changelog' serial > numbers. Basically it is quite trivial and efficient to run this setup > in an incremental indexing mode. > > As you can see from the results, even my lowly colo is trouncing what > is currently on PyPI, and so my thoughts tend toward making an > arrangement like this more permanent. > > The crawler code weighs in at 150 lines, the indexer a meagre 113 > lines, and the Flask example app is 74 lines. Implementing an exact > replica of PyPI's existing scoring function is already partially done > at indexing time, and the rest is quite easy to complete (mostly > cutpasting code). > > Updating the Flask example to provide an XML-RPC API (or similar), > then *initially augmenting* the old search facility seems like a good > start, with a view to removing the old feature entirely. Integrating > indexing directly would be pointless, the PyPI code really doesn't > need anything more added to it until it gets at least reorganized a > little. > > So for the cost of 334mb of disk, a cron job, and a lowly VPS with > even just 700MB RAM, PyPI's search pains might be solved permanently. > Naturally I'm writing this mail because it bothers me enough to > volunteer help. :) > > Prototype code is here: https://bitbucket.org/dmw/pypi-search (note: > relies on a pre-alpha quality DB library I'm hacking on) > > Thoughts? _______________________________________________ Distutils-SIG maillist - [email protected] http://mail.python.org/mailman/listinfo/distutils-sig
