On Tue, Jun 22, 2010 at 12:07 AM, Kamran Riaz Khan <[email protected]> wrote: > Hello all, > > One of the areas I am working for Summer of Code is attachment > searching. Basically the aim is to let an Arsenal user do something like > this: > > "Search for <text> in attachments of all bugs that exist in source > packages subscribed by <team>"
This pretty much has to be done using some sort of index. 'attachments of all bugs that exist in source packages subscribed by' can run to tens of thousands of attachments totaling gigabytes of data. Even if you only look in the first few kilobytes of an attachment, that is still hundreds or maybe thousands of librarian requests that need to be made. For some real numbers for the ubuntu-bugs team, that query matches 32000 attachments averaging 128kb in size totaling 4GB of data. I don't think using our existing database full text search will be useful - this is for searching for words in text but your examples need some sort of substring search. An external search engine might be better, such as Google or a Google appliance, but they are still word searches to some extent. I'm really not sure of the best way to tackle this problem. The Librarian data is not stored in the database because there are multiple TB of files. The team membership information is in the relational database. There are no indexes anywhere to the contents of the Librarian files. I think we need some sort of external search engine (I don't think we don't want to integrate this into the Librarian core). Ideally we could feed it subscriber information allowing it to determine the set of 32000 attachments that ubuntu-bugs has access to rather than having to calculate this information from the relational db and then feed the ids to the search engine. Whatever approach certainly needs signoff from the LP team leads, as the resource requirements are non trivial and someone needs to pay for the hardware. An alternative approach would be to keep the search separate from Launchpad. A team would host their own search engine somewhere. A LaunchpadAPI script would be run regularly, pulling new attachments meeting the teams criteria from the Librarian and feeding them into the search engine. -- Stuart Bishop <[email protected]> http://www.stuartbishop.net/ _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

