Grant McLean wrote on 7/10/11 10:28 PM: > Hi all > > I'm just getting started with trying out Lucy. Installation went without > a hitch and I've successfully worked my way through the tutorials. > Congratulations on getting the project to this level of quality. > > My main interest is indexing HTML documents for web sites. It seems > that if I feed the HTML file contents to the Lucy indexer, all the > markup (tags and attributes) ends up in the index and consequently comes > back out in the highlighted excerpts. Is it my responsibility to strip > the tags out before passing the text to the indexer? Or is there a > simple option I can enable somewhere to have this happen automatically? >
Consider using Swish3 with the Lucy backend. http://search.cpan.org/dist/SWISH-Prog-Lucy/ If you install SWISH::Prog::Lucy you'll get the swish3 cli with which you can easily index .html, .xml, .pdf, .doc, .xls, .txt, etc. Example: index docs: % swish3 -F lucy -i path/to/html/files search docs: % swish3 -q 'some query' Since the index created is a standard Lucy index, you can search it with the relevant Lucy classes, or use the SWISH::Prog::Lucy::Searcher wrapper (which automatically refreshes the index handle when the index is updated). See also the new Dezi REST server if you want to put a web service in front of your Lucy index, like Solr: http://search.cpan.org/dist/Dezi Docs are still a bit sparse; get in touch if you're interested in helping flesh them out. -- Peter Karman . http://peknet.com/ . [email protected]
