How to approach indexing source code?

Johan Tibell Tue, 03 Jun 2014 18:33:25 -0700

Hi,

I'd like to index (Haskell) source code. I've run the source code through a
compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).


I'm wondering how to approach indexing source code. I can see two possible
approaches:

 * Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a simple
line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function name,
OffsetAttribute to write the source span, etc.

 * Use and IndexWriter to create a Document directly, as done here:
http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more likely to
work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

 - Index several tokens at the same position, so I can index both the fully
qualified name (e.g. module.myFunction) and unqualified name (e.g.
myFunction) for a term.

-- Johan

How to approach indexing source code?

Reply via email to