Hi All,
My new version of the proposal. I have not finished researching all the
remarks that Gregor has made (like the lucene query bean), but it would
be nice to have some comments on the changes i have made. There is not
much time left, so all your responses would be appreciated very much!
Regards, Robert
* Google Summer of Code proposal *
Version: Second draft version
Date: 8 june 2005
Subject: Apache's lenya-search project
Intended audience: Current maintainers and potential mentor(s)
Author: Robert Goene, University of Amsterdam, The Netherlands
= Project description =
The project will consist of a number of subprojects, which can be
developed fairly isolated from each other. This section will give
a functional description and an overview of the techniques used for
each individual subproject.
* Integrate the indexing process with the publishing process
* Index the document when published
* Remove the document from the index when deactivated
* Implement standard lenya document parser
(org.apache.lenya.lucene.parser)
* Use index command attributes instead of the ConfigurableIndexer.
As a replacement for the ConfigurableIndexer that creates indexes from a
document based
on a collection of xpath statements, i would like to propose an alternative
way of
configuring the indexed data. This replacement would consist of tags in the
internal xml
documents of Lenya. Every xml element that must be added to the index need
a special
attribute, something like indexField="fieldName".
One of the big advantages of this approach would be the availability of
data that isn't
visible for the outside world, but could be helpful for the search
mechanism to determine
the most relevant results. One could think of the metadata that isn't
completely rendered
to html, like the date of creation or the creator.
Besides this, it would be more easy to add a new document type to Lenya
when the indexing of the
document can be specified in the sample document and the Relax NG schema.
An implementation of net.nutch.parse would be needed for this porting,
although
i am not sure yet how to add fields to the index by looking at the API
documentation
of Nutch. I will have to dig into the Nutch further.
I am not sure what the role of 'Jackrabbit' is in the Lenya publication at
this time. Could someone
give me a short explanation and some pointers to source and webpages?
* Document boost
By Adding an extra field to the metadata of the documents called 'Document
Boost' it will be possible to use the boosting feature of Lucene to control
the relevance of specific documents in the search results. A pulldown menu
with a choosable digit to specify the boostlevel should be sufficient.
* Extract external links
The publish process should also extract all the external links - html and
pdf - from the document and
add them to the nutch crawler, so they can be fetched and indexed in the
next Nutch run.
In a similar fashion, the external links should be removed from the Nutch
and the Lucene index when
deactivating a document.
* Nutch integration for external crawling
It should be possible to add external pages to the Lucene index. For instance
pages that are part
of the website, but are not controlled by Lenya or external pages that
contain related content. The
crawling of these sites will not be a problem. Linking to external pages on
one of the pages controlled
should be enough to crawl these pages and add them to the lucene index.
* Create usecase for manually calling the index task
* Schedule the nutch indexing task
* Log the nutch activities in the Task log.
* Create Usecase for searching the current publication
The current search pipeline is not a part of a specific publication, but is
part of the
general lenya configuration. By making it a usecase, it will be more
convenient to address
the search facility from a html form and it will be easier to change the
search needs
of a specific publication.
Solprovider already has implemented a feature like this. In my opinion, it
looks pretty good,
but can be revised and simplified with the changes proposed in this document,
especially the
replacement of the generator. (http://www.solprovider.com/lenya/search)
* Replace custom Lucene search generator with Cocoon Search generator
The current way of querying the Lucene index is by means of a very
long and nasty xsp page. The code is not easy to penetrate for making
small changes in the creation of the resultset.
There is a very clean and easy alternative to this nasty xsp page the
xslt sheets that process the result it: the Cocoon search generator
(http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html)
By using this generator instead of the clumpsy search pipeline currently
employed, it will be easier to debug or change the resultset for a
specific publication. Besides this, it seems to me as a good practice
to take advantage of Cocoon's facilities as much as possible.
* Simplify the current search navigation component
Make the current search form more usable, visually attractive and easier to
integrate to
a publication.
* Related navigation component
Besides the results of a explicit query of the user, it could be interesting
to add a navigation
component that searches the Lucene index for related pages. This could be
done on the subject or
the description fields of the document. The results can be integrated in the
document as a flexible
way of navigation through the publication.
* Planning *
TBA
* Future consideration
* Add Lucene indexviewer *
To have an overvieuw of the created index it should be fairly simple to
integrate the
indexviewer Limo (http://limo.sourceforge.net/) to the administration mode of
the
Lenya interface. The viewer is an easy tool to dig into the created index
when the
search results are different than you expected. This tool is indispensable
when working
with the ConfigurableIndexer to have an overview of the created Lucene fields
and their
content.
The tool is written as an Apache Licensed java servlet and the only
information
it needs to function is the path to the Lucene index. The integration should
therefor be
fairly easy.
* Jackrabbit and Lucene
The role of Jackrabbit seems to apply to more structured queries as XQuery
provides them. The unstructured
fulltext searching, as non-computers will use most of the time, is the area
of the Lucene engine.
When the Lenya API will be changed to make use of all the features that
Jackrabbit promisses us, the document
parser as proposed above will have to be moved to the Lucene interface of
Jackrabbit. Jackrabbit will be
responsible for a job that, for the time being, will be executed by Lenya.
At this point of time, the Jackrabbit integration is only a future
consideration and should be given account
for when developing new features. The document parser will be developed with
the Jackrabbit API in mind.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]