Re: How to approach indexing source code?

Michael Sokolov Thu, 05 Jun 2014 05:46:21 -0700

If you already have a parser for the language, you could use it tocreate a TokenStream that you can feed to Lucene. That way you won't betrying to reinvent a parser using tools designed for natural language.


-Mike


On 6/5/2014 6:42 AM, Johan Tibell wrote:

I will definitely try a prototype. My main question is whether I'm better
off creating documents directly or if I should try to parse the compiler
output using an analyzer/tokenizer.


On Thu, Jun 5, 2014 at 12:24 PM, Aditya <findbestopensou...@gmail.com>
wrote:

It is up to your requirement. You could either index source file or
compiler output. Try doing some proof of concept. You will get some idea of
how to move forward.

Regards
Aditya
www.findbestopensource.com




On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell <johan.tib...@gmail.com>
wrote:

By "index the entire source file" do you mean "don't index the compiler
output"? If so, that doesn't sound very appealing as it loses most of the
benefit of having a search engine built for searching source code.


On Thu, Jun 5, 2014 at 11:11 AM, Aditya <findbestopensou...@gmail.com>
wrote:

Just keep it simple. Index the entire source file. One source file is

one

document. While indexing preserve dot (.), Hypen(-) and other special
characters. You could use whitespace analyzer.

I hope it helps

Regards
Aditya
www.findbestopensource.com


On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell <johan.tib...@gmail.com>
wrote:

The the majority of queries will be look-ups of functions/types by

fully

qualified name. For example, the query [Data.Map.insert] will find

the

definition and all uses of the `insert` function defined in the

`Data.Map`

module. The corpus is all Haskell open source code on

hackage.haskell.org.

Being able to support qualified name queries is the main benefit of
indexing the output of the compiler (which has resolved unqualified

names

to qualified names) rather than using a simple text-based indexing.

There are three levels of name qualification I want to support in

queries:

  * Unqualified: myFunction
  * Module qualified: MyModule.myFunction
  * Package and module qualified: mypackage-MyModule.myFunction

I expect the middle one to be used the most. The last form is

sometimes

needed for disambiguation and the first is nice to support as a

shorthand

when the function name is unlikely to be ambiguous.

For scoring I'd like to have a couple of attributes available. The

most

important one is whether a term represents a use site or a definition

site.

This would allow the definition of a function to appear as the first

search

result.

Is this precise enough? Naturally the scope will grow over time, but

this

is the core of what I'm trying to do.

-- Johan


On Wed, Jun 4, 2014 at 8:02 AM, Aditya <findbestopensou...@gmail.com
wrote:

Hi Johan,

How you want to search, What is your search requirement and

according

to

that you need to index. You could check duckduckgo or github code

search.

The easiest approach would be to have a parser which will read each

source

file and indexes as a single document. When you search, you will

have a

single search field which will search the index and retrieves the

result.

The search field accepts any text in the source file. It could be

function

name, class name, comments or variables etc.

Another approach is to have different search fields for Functions,

Classes,

Package etc.  You need to parse the file, identify comments,

function

name,

class name etc and index it in a separate field.


Regards
Aditya
www.findbestopensource.com




On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <

johan.tib...@gmail.com

wrote:

Hi,

I'd like to index (Haskell) source code. I've run the source code

through a

compiler (GHC) to get rich information about each token (its

type,

fully

qualified name, etc) that I want to index (and later use when

ranking).

I'm wondering how to approach indexing source code. I can see two

possible

approaches:

  * Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a

simple

line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function

name,

OffsetAttribute to write the source span, etc.

  * Use and IndexWriter to create a Document directly, as done

here:

http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more

likely

to

work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

  - Index several tokens at the same position, so I can index both

the

fully

qualified name (e.g. module.myFunction) and unqualified name

(e.g.

myFunction) for a term.

-- Johan



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to approach indexing source code?

Reply via email to