Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich information about each token (its type, fully qualified name, etc) that I want to index (and later use when ranking).
I'm wondering how to approach indexing source code. I can see two possible approaches: * Create a file containing all the metadata and write a custom tokenizer/analyzer that processes the file. The file could use a simple line-based format: myFunction,1:12-1:22,my-package,defined-here,more-metadata myFunction,5:11-5:21,my-package,used-here,more-metadata ... The tokenizer would use CharTermAttribute to write the function name, OffsetAttribute to write the source span, etc. * Use and IndexWriter to create a Document directly, as done here: http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 I'm new to Lucene so I can't quite tell which approach is more likely to work well. Which way would you recommend? Other things I'd like to do that might influence the answer: - Index several tokens at the same position, so I can index both the fully qualified name (e.g. module.myFunction) and unqualified name (e.g. myFunction) for a term. -- Johan