Marvin Humphrey wrote on 4/14/16, 9:06 PM:
On Thu, Apr 14, 2016 at 9:53 AM, Kurt Starsinic<[email protected]> wrote:
I want to use Lucy to index a bunch of source code (mostly Java, XML, Perl,
and C), and I haven't found any clear guidance in the docs.
The easy but not very powerful way is just to index source code as a bag of
words, using a RegexTokenizer which matches `\w+`. But that doesn't meet your
needs...
I'd much prefer
if the index were reasonably syntax-aware (at the very least, it should
distinguish a comment from not-a-comment, but I'd love to distinguish use
from mention).
So for that you're looking at some sort of lex/parse compiler front end for
each language, which you then use to feed into different fields. You could
potentially get quite fine grained.
If I were tackling this project, I would write a SWISH::Filter and use the Dezi
system.
https://metacpan.org/pod/SWISH::Filter#WRITING-FILTERS
Basically, you would use a language-specific parser to convert everything to
XML, which the Dezi system can parse natively.
It really all depends on the level of granularity you want for fields, and what
kind of tokenization you want -- e.g. is "foo()" a single term? or is it "foo"?
--
Peter Karman . http://peknet.com/ . [email protected]