Hi Dominik, On Mon, Sep 26, 2016 at 6:55 PM, Dominik Ruf <[email protected]> wrote: > Hi, > > there are basically 2 different kinds of searches in kallithea. > > 1. filtering revisions > Mads mentioned 2 years ago that he plans to add some support for this > https://bitbucket.org/conservancy/kallithea/issues/18/search-needs-to-be-improved > 2. searching in multiple repositories (inlc. fulltext searching in the > files) > > I think the first point is pretty much strait forward. Git and Mercurial > support filtering revisions. It basically 'only' needs to be implemented. > :-) > > But the second one is more complicated. > There are multiple problems with the current implementation. > > 1. For starters since 9c5f794df7cd the make-index command is broken. But > that can be easily fixed. > 2. What is no so easy to fix, is the fact that indexing is currently > incredibly slow. > 3. The indexing is done periodically, it only indexes the tip revision at > indexing time and the search results refer to the tip at search time. > Therefore > a) you may get hits that are no longer valid > b) you may get no hits even though the string is present now > c) you can't search for things that have been removed > > I believe all this is solvable. I looked into the code and found a few > places where the indexing can definitely be improve. > But I don't have much experience with whoosh. So I'm not sure if it is even > worth it to fix the current implementation, or if I should restart with solr > or elastic search. > > My questions to you guys are: > > 1. Do you have experience with whoosh? Does it scale to gigabytes of data? > 2. Would you even pull a implementation that requires installing solr? Note: > I believe installation and setup of solr can be automated. > 3. Or maybe you thing the fulltext search should be dropped all together. >
I personally think that 'fulltext search' on repositories which are typically containing source code, has relatively little value. Fulltext search like whoosh or solr are providing are not aware of the structure of source code, and thus have no advanced capabilities to search only in identifiers, or click through on symbols in the search result. Real code browsers, like OpenGrok or LXR, do have such features. The few times that I actually use fulltext search on e.g. GitHub is when I'm too lazy to actually clone the repo and use a grep-like tool to find it myself. It definitely has some value, but not so much. With this in mind, I actually think there is much more value in fixing the first type of search you highlight, i.e. filtering revisions. Therefore, in my opinion we should prioritize 'just implementing' that before looking at fulltext search. Coming back to fulltext search: - I have no specific experience with whoosh - Regardless of the tool we'd use (whoosh, solr, ...), I think it should always be optional. Kallithea should be installable without search capabilities. - It may be more useful to implement a flexible way where Kallithea allows searching, but that the backend is customizable. I.e. the search term can either be passed to whoosh, solr, or any other tool that the user wants to configure. The tool would get the search term and probably some other elements referring to the repo to search or specific paths in the repo. Kallithea documentation can give some examples on how to plug in known tools into this, but need not be concerned with the entire gamma of tools available, nor choose one specific one that may not scale to a particular use case. The same could even be used to hook in code browsers like OpenGrok/LXR in the search feature, rather than pure text search. Best regards, Thomas _______________________________________________ kallithea-general mailing list [email protected] http://lists.sfconservancy.org/mailman/listinfo/kallithea-general
