Robert Goene wrote:
* Replace custom Lucene search generator with Cocoon Search generator *
The current way of querying the Lucene index is by means of a very
long and nasty xsp page. The code is not easy to penetrate for making
small changes in the creation of the resultset.
There is a very clean and easy alternative to this nasty xsp page the
xslt sheets that process the result it: the Cocoon search generator
(http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html)
By using this generator instead of the clumpsy search pipeline currently
employed, it will be easier to debug or change the resultset for a
specific publication. Besides this, it seems to me as a good practice
to take advantage of Cocoon's facilities as much as possible.
the querybean might be even better suited:
http://svn.apache.org/viewcvs.cgi/cocoon/branches/BRANCH_2_1_X/src/blocks/querybean/java/org/apache/cocoon/bean/query/
* Nutch integration *
* Create Ant task/usecase for manually calling the index task
this project is for lenya trunk, so it would use the new usecase
framework, not ant tasks
* Create a schedule possibility for indexing
* Incremental search based on a changed sitetree or an extra step in the
publish job.
this can happen as part of document.save() (again on trunk)
* Log the nutch activities in the Task log.
this project is for lenya trunk, so it would use the new usecase framework
* Implement standard lenya document parser (net.nutch.parse)
* Document boost
By Adding an extra field to the metadata of the documents called 'Document
Boost' it will be possible to use the boosting feature of Lucene to control
the relevance of specific documents in the search results. A pulldown menu
with a choosable digit to specify the boostlevel should be sufficient.
* Port ConfigurableIndexer
The ConfigurableIndexer is a very convienient tool to index custom xml
documents
or to have more control over the indexing of plain xhtml documents. It should
be ported to nutch, because this crawler does not offer a comparable feature.
An implementation of net.nutch.parse would be needed for this porting, although
i am not sure yet how to add fields to the index by looking at the API documentation
of Nutch. I will have to dig into the Nutch and ConfigurableIndexer further.
While doing so, the ConfigurableIndexer should also be considered as a tool
to specify
and index alternative datasources, like a metadata repository that is
accessible through
XPath.
Advice on this issue will be appreciated!
* Add Lucene indexviewer *
To have an overvieuw of the created index it should be fairly simple to integrate the
indexviewer Limo (http://limo.sourceforge.net/) to the administration mode of the
Lenya interface. The viewer is an easy tool to dig into the created index
when the
search results are different than you expected. This tool is indispensable
when working
with the ConfigurableIndexer to have an overview of the created Lucene fields
and their
content.
The tool is written as an Apache Licensed java servlet and the only
information
it needs to function is the path to the Lucene index. The integration should therefor be
fairly easy.
* Create Usecase for searching the current publication *
The current search pipeline is not a part of a specific publication, but is part of the
general lenya configuration. By making it a usecase, it will be more convenient to address
the search facility from a html form and it will be easier to change the
search needs
of a specific publication.
Solprovider already has implemented a feature like this. In my opinion, it
looks pretty good,
but can be revised and simplified with the changes proposed in this document,
especially the
replacement of the generator. (http://www.solprovider.com/lenya/search)
this project is for lenya trunk, so it would use the new usecase framework
* Simplify the current search navigation component
Make the current search form more usable, visually attractive and easier to integrate to
a publication.
this can be accomplished easily with the query bean
* Related navigation component
Besides the results of a explicit query of the user, it could be interesting
to add a navigation
component that searches the Lucene index for related pages. This could be
done on the subject or
the description fields of the document. The results can be integrated in the
document as a flexible
way of navigation through the publication.
+1
* Metadata crawling *
I am not sure what the role of 'Jackrabbit' is in the Lenya publication at
this time. Could someone
give me a short explanation and some pointers to source and webpages?
* Planning *
TBA
** Future considerations **
A complementary section with some rough ideas on some possible future
extensions of the search facilities.
* Internal crawling *
Instead of crawling the formatted html output, we could consider crawling the
xml documents as used by Lenya to render the html. One advantage would be the
availability of data that isn't visible for the outside world, but could be
helpful for the search mechanism to determine the most relevant results.
One could think of the metadata that isn't completely rendered to html,
like the date of creation or the creator.
crawling should not be necessary if search is implemented as part of the
API.
* External crawling *
It could be interesting to add external pages to the Lucene index. For
instance pages that are part
of the website, but are not controlled by Lenya or external pages that
contain related content. The
crawling of these sites will not be a problem, the method of defining these
pages is not that trivial.
I have considered using the XBEL standard
(http://pyxml.sourceforge.net/topics/xbel/) for the storage
of the links and the use of an external tool like the Firefox Bookmark Synchronizer plugin for the
editing of these external pages.
this is already part of the homegrown parser, and would have to be
supported through nutch.
one thing that i suggest you add to the proposal is how to do
* seaching of metadata fields
* implications / integration with jackrabbit
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]