On Thu, Apr 14, 2016 at 9:53 AM, Kurt Starsinic <[email protected]> wrote:

> I want to use Lucy to index a bunch of source code (mostly Java, XML, Perl,
> and C), and I haven't found any clear guidance in the docs.

The easy but not very powerful way is just to index source code as a bag of
words, using a RegexTokenizer which matches `\w+`.  But that doesn't meet your
needs...

> I'd much prefer
> if the index were reasonably syntax-aware (at the very least, it should
> distinguish a comment from not-a-comment, but I'd love to distinguish use
> from mention).

So for that you're looking at some sort of lex/parse compiler front end for
each language, which you then use to feed into different fields. You could
potentially get quite fine grained.

* package names
* class names
* imports
* comments
* base/extends/implements
* function bodies
* return types
* file name
* url
* content [i.e. all content together]
* ...

Each field would be ordinary flat text.  (You might want to insert some fake
separator token in between function bodies to prevent spurious phrase
matching.)  Exactly how you get flat text out of a compiler front end is going
to be specific to the module.

For parsing Perl source code, you presumably want PPI.  For XML, choose your
favorite XML module.  For Java/C, I don't know -- perhaps someone else has a
suggestion.

The next phase is designing a decent query interface.  Searching all fields
with default weighting is unlikely to yield optimum results, so you'll have to
tune it like you would any other search app.  Your users are probably
sophisticated and will also appreciate an "advanced" interface.

Finally, you'll want excerpting.  That's what you need the `content` field
for.  Hopefully Lucy's Highlighter will choose good excerpts out of the box.

Add a link from the URL field, and there you go!

> I'll be happy to formally document this, once I get it working.

It would be cool to get some sort of markdown document for the Lucy Cookbook,
similar to these!

https://github.com/apache/lucy/tree/apache-lucy-0.5.0/core/Lucy/Docs/Cookbook

Marvin Humphrey

Reply via email to