Hi, It's that time again. Time for a "State of the Pooch" email to let the community know how we're doing with Beagle and where we're going. Previous addresses are here:
http://mail.gnome.org/archives/dashboard-hackers/2006-November/msg00064.html http://mail.gnome.org/archives/dashboard-hackers/2005-May/msg00011.html A lot of the stuff in the previous SotP, roughly a year ago, still applies in some way today. * dBera is now co-maintainer I'm happy to announce that Debajyoti Bera, who has easily written more code for Beagle in the last year than anyone else, has become a co-maintainer of the project. This is great news because he has solid knowledge of the codebase and is the first non-Novell maintainer of the code. dBera will still be mostly coding, and he will have equally final say about patches, technical direction, etc. with me. He may also do releases from time to time. :) * The never-ending quest for 0.3.0 Work continues in trying to make a great 0.3.0 release, and in the meantime we're pushing out 0.2.x maintenance releases. I'd love it if people could be regularly running from SVN trunk so that we can stress test a lot of the features that I'll mention below and get a 0.3.0 release out there that the less adventurous users out there can enjoy. * Networked searches Thanks to work from Lukas Lipka and Fredrik Hedberg, we've (finally!) merged the network search code from Alexis and Kyle's Summer of Code projects from last year into the codebase. The Beagle daemon now provides a backend which can query other Beagle instances. There is some preliminary support for Avahi and autodiscovery of other Beagle daemons on the network, but that's currently disabled while some stability bugs are worked out. There's still a lot of work to be done here in terms of how we access non-file resources on remote machines, security concerns, etc., so this code should be considered experimental for now. You can turn it on by toggling the networked setting in one of the configuration tools. * Web user interface dBera and Nirbheek Chauhan have been working on a Web interface to Beagle. In addition to search results, index information, daemon status, and the ability to shut down the daemon are all possible through this UI. The Web UI relies on the network infrastructure. It's not meant to be a replacement to beagle-search, but it is nice in that it is easily skinable, will be easy to view email in the browser, etc. http://beagle-project.org/Beagle_Webinterface * New configuration system dBera has been working on a new configuration system to handle two shortcomings in the current system: (1) Allowing a system-wide configuration file, so sysadmins can apply policy to all users and (2) allow plugins (like filters and backends) to store and retreive their own configuration options. The configuration manager loads the global config file (in /etc/beagle/config/) and the local one, merging the two. This also fixes the current problem where all settings were saved in the user's config file, not just the ones that are changed from the default. * Xesam support Arun Raghavan has written an adapter to Beagle which implements the Xesam freedesktop.org search spec, and the reference tools run against it. Exactly how this will be integrated into the code is unclear at this point, however. As of right now, there are no fully fledged search tools which use the Xesam API, so we're not ready to commit to the APIs natively. Also, integrating D-Bus back into Beagle is a worthy goal, but will require quite a bit of work. * Firefox extension More great Summer of Code work, the new Firefox extension has been merged into the source tree. In addition to indexing web pages as you fiew them, you can now index web pages, links, and images on demand. The settings UI is greatly improved as well. http://dtecht.blogspot.com/2007/08/hey-firefox-beagle-this-now.html * Thunderbird extension Another SoC project, we decided to take a different approach from the previous Thunderbird work and the Evolution backend, and add support for Thunderbird through an extension. This extension is responsible for sending emails to the running Beagle daemon for indexing. While you have to be running Thunderbird for this to work, it's fast and much, much friendlier on the system resources. * Experimental RDF branch This is an experimental branch which will export an RDF service that clients can query. This is something that has been planned from the beginning in Beagle, but we've never gotten around to it until now. As data is indexed, an RDF store will be created alongside the text index, and more complex relationships between the data can be examined. * Lots of work to be done My list of things I would like to see get some attention: - Rewrite of the file system backend. I've mentioned this on the list before, but I wanted to give a little more info. When we designed the file system backend, we decided to largely separate files and folders from their file system hierarchy. This allowed us to handle moves of an infinite number of files underneath a folder instantaneously. However, in doing so we had to trade off the ability to search for files underneath a given folder. In retrospect, I think this was the wrong decision. In addition to adding a ton of complexity to the code, it has a major negative effect on memory usage and prohibits users from doing an extremely common type of search. I feel that the file system backend has to be rewriten much more simply, with the parent-child relationship of files indexed and easily searchable. This will make large moves inefficient, but will make a more common use case possible. (And moving large numbers of files is what I call a "thundering herd" problem, and one that has to be dealt with anyway, because things like "rm -rf" already trigger them.) - D-Bus back in Beagle as the primary message system. I wrote the current serialized XML format a couple of years ago now and while it's served us well, I think that junking that code and switching back to D-Bus is the right thing to do. D-Bus has matured and stablized considerably, and we now have a totally native C# implementation of the protocol. In the end, I think it will be quite a bit faster than the automatic XML parsing that happens today. - Removable media. It came up again fairly recently on the list, but I'd like to see some sort of integration of Beagle with HAL so that many removable devices can be indexed automatically, and make it possible to retrieve information about files from offline storage like CDs. - Test suites. We had a Novell-internal test suite for many file formats for a while, but the majority of those files I couldn't distribute. We're gradually building up a good set of files in SVN to test, but we really need people to start writing test harnesses for those files and regression tests for individual subsystems. This work will help stability and development tremendously. * Miscellaneous other nicities. - Reworking of child indexables (ie, PDF inside a ZIP inside an email): These are faster and use less memory than before. - Taglib-sharp: Use this, an actively developed and maintained library, for extracting metadata from audio files. - Snowball analyzers: The first step toward language based indexing. - Sqlite3 and Mono.Data.Sqlite: In 0.3.0 we will support only sqlite version 3, and use the upstream, maintained Mono APIs for this, which should greatly reduce bugs. - Nautilus metadata: Emblems, notes, and other metadata that are set through GNOME's Nautilus file manager are now indexed. This was a proof of concept implementation for how to extract metadata from external sources; there is also an API for this that F-Spot uses. - TeX filter: One of the most oft-requested features. - TextCache: We were wasting TONS of disk space with the way things were laid out before. Thanks to dBera and Arun, we now have a hybrid file system and database system for much more optimal storage of text data from complex files. http://dtecht.blogspot.com/2007/10/i-saved-80mb.html - Snippets: The gross way of getting HTML snippets back is fixed. You can now request the size of the snippet you want and get structured data back so that it's easier to present and doesn't require an HTML widget or a regexp to transform the output. You also now get the line of the file that the snippet is on, the sentence before the match, and the sentence after the match. - New query API to retrieve metadata about a particular URI, including the complete cached text rather than a snippet. I think that's most of the big stuff! As I always do, I am sure I forgot something. But hopefully it won't be another 11 months before the next one of these emails. Your attention and help are appreciated! Thanks, Joe _______________________________________________ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers