Re: Google Summer of Code

Gregor J. Rothfuss Sun, 05 Jun 2005 14:12:51 -0700

Robert Goene wrote:

* Replace custom Lucene search generator with Cocoon Search generator *
  The current way of querying the Lucene index is by means of a very
  long and nasty xsp page. The code is not easy to penetrate for making
small changes in the creation of the resultset.There is a very clean and easy alternative to this nasty xsp page thexslt sheets that process the result it: the Cocoon search generator
  (http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html)
  By using this generator instead of the clumpsy search pipeline currently
employed, it will be easier to debug or change the resultset for aspecific publication. Besides this, it seems to me as a good practice
  to take advantage of Cocoon's facilities as much as possible.


the querybean might be even better suited:

http://svn.apache.org/viewcvs.cgi/cocoon/branches/BRANCH_2_1_X/src/blocks/querybean/java/org/apache/cocoon/bean/query/

* Nutch integration *

 * Create Ant task/usecase for manually calling the index task

this project is for lenya trunk, so it would use the new usecaseframework, not ant tasks

 * Create a schedule possibility for indexing
* Incremental search based on a changed sitetree or an extra step in thepublish job.


this can happen as part of document.save() (again on trunk)

 * Log the nutch activities in the Task log.


this project is for lenya trunk, so it would use the new usecase framework

 * Implement standard lenya document parser (net.nutch.parse)

* Document boostBy Adding an extra field to the metadata of the documents called 'Document

   Boost' it will be possible to use the boosting feature of Lucene to control
   the relevance of specific documents in the search results. A pulldown menu
   with a choosable digit to specify the boostlevel should be sufficient.

 * Port ConfigurableIndexer

   The ConfigurableIndexer is a very convienient tool to index custom xml 
documents
   or to have more control over the indexing of plain xhtml documents. It should
   be ported to nutch, because this crawler does not offer a comparable feature.

An implementation of net.nutch.parse would be needed for this porting, althoughi am not sure yet how to add fields to the index by looking at the API documentation

   of Nutch. I will have to dig into the Nutch and ConfigurableIndexer further.

   While doing so, the ConfigurableIndexer should also be considered as a tool 
to specify
   and index alternative datasources, like a metadata repository that is 
accessible through
   XPath.

   Advice on this issue will be appreciated!

* Add Lucene indexviewer *

To have an overvieuw of the created index it should be fairly simple to integrate theindexviewer Limo (http://limo.sourceforge.net/) to the administration mode of the

  Lenya interface. The viewer is an easy tool to dig into the created index 
when the
  search results are different than you expected. This tool is indispensable 
when working
  with the ConfigurableIndexer to have an overview of the created Lucene fields 
and their
  content.

  The tool is written as an Apache Licensed java servlet and the only 
information

it needs to function is the path to the Lucene index. The integration should therefor befairly easy.


* Create Usecase for searching the current publication *

The current search pipeline is not a part of a specific publication, but is part of thegeneral lenya configuration. By making it a usecase, it will be more convenient to address

  the search facility from a html form and it will be easier to change the 
search needs
  of a specific publication.

  Solprovider already has implemented a feature like this. In my opinion, it 
looks pretty good,
  but can be revised and simplified with the changes proposed in this document, 
especially the
  replacement of the generator. (http://www.solprovider.com/lenya/search)


this project is for lenya trunk, so it would use the new usecase framework

* Simplify the current search navigation component
Make the current search form more usable, visually attractive and easier to integrate toa publication.


this can be accomplished easily with the query bean

* Related navigation component

  Besides the results of a explicit query of the user, it could be interesting 
to add a navigation
  component that searches the Lucene index for related pages. This could be 
done on the subject or
  the description fields of the document. The results can be integrated in the 
document as a flexible
  way of navigation through the publication.

+1

* Metadata crawling *

  I am not sure what the role of 'Jackrabbit' is in the Lenya publication at 
this time. Could someone
  give me a short explanation and some pointers to source and webpages?


* Planning *

TBA

** Future considerations  **

A complementary section with some rough ideas on some possible future 
extensions of the search facilities.

* Internal crawling *

 Instead of crawling the formatted html output, we could consider crawling the
 xml documents as used by Lenya to render the html. One advantage would be the
 availability of data that isn't visible for the outside world, but could be
 helpful for the search mechanism to determine the most relevant results.

One could think of the metadata that isn't completely rendered to html,like the date of creation or the creator.

crawling should not be necessary if search is implemented as part of theAPI.

* External crawling *

  It could be interesting to add external pages to the Lucene index. For 
instance pages that are part
  of the website, but are not controlled by Lenya or external pages that 
contain related content. The
  crawling of these sites will not be a problem, the method of defining these 
pages is not that trivial.
  I have considered using the XBEL standard 
(http://pyxml.sourceforge.net/topics/xbel/) for the storage

of the links and the use of an external tool like the Firefox Bookmark Synchronizer plugin for theediting of these external pages.

this is already part of the homegrown parser, and would have to besupported through nutch.


one thing that i suggest you add to the proposal is how to do

* seaching of metadata fields
* implications / integration with jackrabbit

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Google Summer of Code

Reply via email to