Hi, I am maintainer of the GNU Image Finding Tool and active in the Fer-de-Lance project that's been in (not very loud, but behind-the-scenes-active) exsistence since April last year. Within this project we work towards integration of searching services into the desktop.
I am mailing our list and a couple of other developer lists, because I think I have found an architecture that provides security while maintaining most of the advantages of demon-based search engine architectures. I think this architecture and associated tricks are flexible enough to encompass different search engines, so this mail is not about Medusa vs. htdig vs. GIFT, but rather how to work together to solve our common security problems for desktop integration of our engines. And, of course, I would be happy to get some suggestions for improvement and/or some developer time. I would be less happy if someone finds a fundamental flaw, but also this would be better than wasting my time trying to develop this stuff further. Now let's go into more detail. GOAL: The goal is to provide search services to the desktop user. These search services should not only encompass web-visible URLs, but rather all files accessible to the user as well as http/ftp/etc. accessible items. ISSUE: The first issue is -privacy-: the system should not tell us locations of files that we cannot read otherwise. For example: looking for some correspondence with the health insurance, we do not want to know that our colleague wrote last month three letters that match our search. Second -memory consumption-: All indexes for similarity searching use memory which is either proportional to the size of each indexed file, or quite big to begin with. We do not want plenty of users that roll their own index, we want one index, otherwise we are likely to spend a multiple of our useful disk space on indexes. SUGGESTION: Use a daemon and make sure that authentication is good. :-) Too easy? Of course the problem lies in providing the authentification. What I suggest is to run a daemon which creates for each user U an unix domain socket which is readable *and* writable *only* by this one user U (and root, of course). All instructions to the indexing demon like e.g. add item to index delete item from index move item within index (new URL for same file) block item/subdirectory/pattern (don't index *.o files for example) process query would go through the socket. By knowing which socket received the request, we automatically know the user, and then we just have to compare for each result item, if it can be read by the user who issued the query. Of course we give back only the readable items. We can create the sockets as user "nemo", and then chown them using a very small script running under root. So we would be root during a couple of seconds on startup, afterwards everything would happen as a user (nemo) who has write rights on one directory tree which is unreadable for all else. So there is not the issue of a big indexing program running under root for days and days in a row. Adding an item is a (small) issue. We probably have to pipe the uuencoded (or something equivalent) binary through the socket in order to have it indexed on the other side of the socket. However, I guess the efficiency overhead is small compared to the indexing cost. Things become a trifle more complex for adding items which are found on the web. Somebody indexing a web page should probably indicate who else (group, all) is allowed to know that somebody's indexed that page. If several users publish an URL the least restrictive rights are taken into account. WHATS THERE? WHAT'S NEEDED? Basically, I have tried out the socket stuff with a small test program. Works. Now I am starting to integrate that with the GIFT (which involves cleaning up some of my internet socket code). What's still needed is the filter that stores which URLs are indexed under which owner, and with which rights. On each query GIFT can ask this filter, if a list of URLs can be given out as query result. Currently, I would like to base this filter on MySQL. When that filter is in place, writing a medusa-plugin for the GIFT would be easy. I just finished a primitive htdig GIFT plugin which soon goes to CVS, so that one just needs some fleshing out. CONCLUSION I hope to have convinced you that we can get relatively easily a secure, yet memory efficient indexing solution for the desktop. If this is been already done, please tell me where. If my mail is a stupid suggestion, please tell me that, too. However, if you would like to participate in the coding and design effort or simply to share your opinion, please do not hesitate to subscribe to the fer-de-lance-development list. Cheers, Wolfgang _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
