Re: Google Summer of Code

Robert Goene Wed, 08 Jun 2005 11:55:31 -0700

Hi All,

My new version of the proposal. I have not finished researching all theremarks that Gregor has made (like the lucene query bean), but it wouldbe nice to have some comments on the changes i have made. There is notmuch time left, so all your responses would be appreciated very much!


Regards, Robert

* Google Summer of Code proposal *

Version: Second draft version
Date: 8 june 2005
Subject: Apache's lenya-search project
Intended audience: Current maintainers and potential mentor(s)
Author: Robert Goene, University of Amsterdam, The Netherlands

= Project description =

The project will consist of a number of subprojects, which can be 
developed fairly isolated from each other. This section will give
a functional description and an overview of the techniques used for 
each individual subproject.

 * Integrate the indexing process with the publishing process

  * Index the document when published

  * Remove the document from the index when deactivated

 * Implement standard lenya document parser 
   (org.apache.lenya.lucene.parser)

  * Use index command attributes instead of the ConfigurableIndexer.

    As a replacement for the ConfigurableIndexer that creates indexes from a 
document based
    on a collection of xpath statements, i would like to propose an alternative 
way of 
    configuring the indexed data. This replacement would consist of tags in the 
internal xml
    documents of Lenya. Every xml element that must be added to the index need 
a special
    attribute, something like indexField="fieldName".

    One of the big advantages of this approach would be the availability of 
data that isn't
    visible for the outside world, but could be helpful for the search 
mechanism to determine
    the most relevant results. One could think of the metadata that isn't 
completely rendered 
    to html, like the date of creation or the creator.

    Besides this, it would be more easy to add a new document type to Lenya 
when the indexing of the 
    document can be specified in the sample document and the Relax NG schema.

    An implementation of net.nutch.parse would be needed for this porting, 
although 
    i am not sure yet how to add fields to the index by looking at the API 
documentation
    of Nutch. I will have to dig into the Nutch further.

    I am not sure what the role of 'Jackrabbit' is in the Lenya publication at 
this time. Could someone
    give me a short explanation and some pointers to source and webpages?

  * Document boost

    By Adding an extra field to the metadata of the documents called 'Document
    Boost' it will be possible to use the boosting feature of Lucene to control
    the relevance of specific documents in the search results. A pulldown menu
    with a choosable digit to specify the boostlevel should be sufficient.

  * Extract external links

    The publish process should also extract all the external links - html and 
pdf - from the document and 
    add them to the nutch crawler, so they can be fetched and indexed in the 
next Nutch run.

    In a similar fashion, the external links should be removed from the Nutch 
and the Lucene index when 
    deactivating a document.

* Nutch integration for external crawling

  It should be possible to add external pages to the Lucene index. For instance 
pages that are part
  of the website, but are not controlled by Lenya or external pages that 
contain related content. The
  crawling of these sites will not be a problem. Linking to external pages on 
one of the pages controlled
  should be enough to crawl these pages and add them to the lucene index.

  * Create usecase for manually calling the index task

  * Schedule the nutch indexing task

  * Log the nutch activities in the Task log.

* Create Usecase for searching the current publication

  The current search pipeline is not a part of a specific publication, but is 
part of the 
  general lenya configuration. By making it a usecase, it will be more 
convenient to address
  the search facility from a html form and it will be easier to change the 
search needs
  of a specific publication.

  Solprovider already has implemented a feature like this. In my opinion, it 
looks pretty good,
  but can be revised and simplified with the changes proposed in this document, 
especially the
  replacement of the generator. (http://www.solprovider.com/lenya/search)

* Replace custom Lucene search generator with Cocoon Search generator

  The current way of querying the Lucene index is by means of a very
  long and nasty xsp page. The code is not easy to penetrate for making
  small changes in the creation of the resultset. 

  There is a very clean and easy alternative to this nasty xsp page the 
  xslt sheets that process the result it: the Cocoon search generator
  (http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html)
  By using this generator instead of the clumpsy search pipeline currently
  employed, it will be easier to debug or change the resultset for a 
  specific publication. Besides this, it seems to me as a good practice
  to take advantage of Cocoon's facilities as much as possible.

* Simplify the current search navigation component

  Make the current search form more usable, visually attractive and easier to 
integrate to 
  a publication.

* Related navigation component

  Besides the results of a explicit query of the user, it could be interesting 
to add a navigation
  component that searches the Lucene index for related pages. This could be 
done on the subject or
  the description fields of the document. The results can be integrated in the 
document as a flexible
  way of navigation through the publication.

* Planning *

TBA

* Future consideration

 * Add Lucene indexviewer *

  To have an overvieuw of the created index it should be fairly simple to 
integrate the 
  indexviewer Limo (http://limo.sourceforge.net/) to the administration mode of 
the
  Lenya interface. The viewer is an easy tool to dig into the created index 
when the
  search results are different than you expected. This tool is indispensable 
when working
  with the ConfigurableIndexer to have an overview of the created Lucene fields 
and their
  content.

  The tool is written as an Apache Licensed java servlet and the only 
information
  it needs to function is the path to the Lucene index. The integration should 
therefor be 
  fairly easy.

* Jackrabbit and Lucene

  The role of Jackrabbit seems to apply to more structured queries as XQuery 
provides them. The unstructured
  fulltext searching, as non-computers will use most of the time, is the area 
of the Lucene engine. 

  When the Lenya API will be changed to make use of all the features that 
Jackrabbit promisses us, the document
  parser as proposed above will have to be moved to the Lucene interface of 
Jackrabbit. Jackrabbit will be 
  responsible for a job that, for the time being, will be executed by Lenya.

  At this point of time, the Jackrabbit integration is only a future 
consideration and should be given account
  for when developing new features. The document parser will be developed with 
the Jackrabbit API in mind.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Google Summer of Code

Reply via email to