On Thu, Apr 14, 2016 at 9:53 AM, Kurt Starsinic <[email protected]> wrote:
> I want to use Lucy to index a bunch of source code (mostly Java, XML, Perl, > and C), and I haven't found any clear guidance in the docs. The easy but not very powerful way is just to index source code as a bag of words, using a RegexTokenizer which matches `\w+`. But that doesn't meet your needs... > I'd much prefer > if the index were reasonably syntax-aware (at the very least, it should > distinguish a comment from not-a-comment, but I'd love to distinguish use > from mention). So for that you're looking at some sort of lex/parse compiler front end for each language, which you then use to feed into different fields. You could potentially get quite fine grained. * package names * class names * imports * comments * base/extends/implements * function bodies * return types * file name * url * content [i.e. all content together] * ... Each field would be ordinary flat text. (You might want to insert some fake separator token in between function bodies to prevent spurious phrase matching.) Exactly how you get flat text out of a compiler front end is going to be specific to the module. For parsing Perl source code, you presumably want PPI. For XML, choose your favorite XML module. For Java/C, I don't know -- perhaps someone else has a suggestion. The next phase is designing a decent query interface. Searching all fields with default weighting is unlikely to yield optimum results, so you'll have to tune it like you would any other search app. Your users are probably sophisticated and will also appreciate an "advanced" interface. Finally, you'll want excerpting. That's what you need the `content` field for. Hopefully Lucy's Highlighter will choose good excerpts out of the box. Add a link from the URL field, and there you go! > I'll be happy to formally document this, once I get it working. It would be cool to get some sort of markdown document for the Lucy Cookbook, similar to these! https://github.com/apache/lucy/tree/apache-lucy-0.5.0/core/Lucy/Docs/Cookbook Marvin Humphrey
