Robert Goene wrote:

Hi,

Hereby my new proposal. I have done some more research and have provided a bit more background information and a rudimentary timeline.

Hope you like the looks of it, because the clock is ticking!


please see my commente below, whereas they are mostly comments on implementations


Regards, Robert

------------------------------------------------------------------------

* Google Summer of Code proposal *

Version: Third draft version
Date: 12 june 2005
Subject: Apache's lenya-search project
Intended audience: Current maintainers and potential mentor(s)
Author: Robert Goene, University of Amsterdam, The Netherlands

* Project Overview

 The Lenya-Search project is part of the Lenya Content Management System, as
 hosted by the Apache Foundation. Heavily based on the XML publishing framework
 Cocoon, Lenya combines an easy interface for the end-user with advanced
 possibilities for the xml-aware developer. This makes Lenya both a good choice
 for straight-forward and more complex websites.

 The search facilities of Lenya are based on the Apache project Lucene. This
 search engine takes care of the indexing of documents and processing of the
queries. The lenya-search project has found her objective in the integration of Lenya and Lucene. The current integration is not as easy and flexible as it should be for a complete CMS. The indexing process, for instance, depends on a number
 of home-made indexers that take care of adding all documents to Lucene. This
process must be started manually trough an ant job. The indexers are not flexible enough and should be more focussed on the documents that Lenya is
 dealing with: xhtml documents with Dublin Core metadata. Besides this, custom
 documents should easily be added to the CMS. Lenya should be able to handle
 xml documents of all kinds in a more straightforward way. This proposal is
 part of this more general goal.

In other words: the search facilities should be further integrated in Lenya.
The search possibilities are not trivial to use in a Lenya publication, and
they obviously should be.

The development will be based on the current trunk of the project: version 1.4
This major release contains a large number of architectual changes. A change 
like
the one described here is appropriate to add to this new future release. The
current stable version (1.2) will only be updated with crucial bugfixes. No significant new features will be added.

http://lenya.apache.org/1_4/index.html

* Project description

The project will consist of a number of subprojects, which can be developed fairly isolated from each other. This section will give a functional description and an overview of the techniques used for each individual subproject.

* Integrate the indexing process with the Lenya publishing usecases

 * Index the document when published

   When a document is published it should be added to the Lucene index 
immediately.
   This can be accomplished by extending the publish process, which is 
implemented
   as a Lenya 1.4 usecase.

   
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Publish.html
* Remove the document from the index when deactivated Documents that are no longer a part of the 'Live' section of the Lenya publication
   (the public available website) should be immediately removed from the Lucene 
index.
   In a similar fashion as the publishing of a document, the deactivate usecase 
of
   Lenya 1.4 should be extended with a removal of the document of the Lucene 
index.

I think this could be done more general, such that every time a document
is changing is being indexed, e.g. also after editing, whereas there could
be one index for the authoring and one index for the live area.

Even more general would to search the documents in association with the workflow, but this would probably rather be Lenya 1.>4, but I am mentioning it to point
where I think it would make sense to head to


http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Deactivate.html * Document parser * lenya.index Lucene comes packed with a standard xml and html parser to add documents to the index. This parser fetches the data out of the document and stores this data in different fields of the
      Lucene index. The documents that Lenya works with are extended XHTML 
documents that can be
      parsed with the standard html parser, but they would lack the possibility 
of indexing the
      metadata that comes with these Lenya documents.

      As a replacement for the ConfigurableIndexer that creates indexes from a 
document based
on a collection of xpath statements, i would like to propose an alternative way of configuring the indexed data. This replacement would consist of tags in the internal xml
      documents of Lenya. Every xml element that must be added to the index 
need a special
      attribute, something like indexField="fieldName".

I don't think the ConfigurableIndexer should be replaced, whereas I am not
saying the implementation of it is great. One wants to keep the definition centralized and attached individually (just is the case for the workflow or validation schema of a document) Always the same problem ;-)

IIUC then every document would have to be tagged. What if a field is changing?!

I am not saying your suggestion doesn't make sense for certain cases, but
I wouldn't treat it as replacement, but rather as enhancement

      One of the big advantages of this approach would be the availability of 
data that isn't
      visible for the outside world, but could be helpful for the search 
mechanism to determine
the most relevant results. One could think of the metadata that isn't completely rendered to html, like the date of creation or the creator.

Besides this, it would be more easy to add a new document type to Lenya when the indexing of the document can be specified in the sample document and the Relax NG schema.

adding the indexing to the Schema would probably make life easier but is basically
the same as the current solution (one transformation from one to the other).

Every document in Lenya has an accompanying RelaxNG schema that validates every edited document when
      it is saved.


I don't think a document should require a schema, but I guess we get into a religious war here. But you can definitely not assume that everything is validated by RelaxNG, because Lenya would close itself badly if it would neglect schemas like XSD and others ...

This schema should allow a document to have the index attribute assigned to a 
number of
      elements. These elements should be extended with the lenya.index 
attribute. This must be done for all
elements that are allowed to be added to the Lucene index. This may sound like a lot work, but it shouln't be that hard. An XHTML document, for instance, only needs several metadata and the body elements
      to be specified.

      The following Relax NG snippet should be added to all elements that can 
be indexed. The LenyaFieldName will
contain the name of the Lucene index field.
      <define name="lenya:index">
       <ZeroOrMore>
        <element name="lenya.index">
         <text/>
        </element>
       </ZeroOrMore>
      </define>

      Notice the possibility to add more than one LenyaIndexFieldName. This 
makes it possible to add the same data
to different fields, which can be useful when the user want a general or more specific search: the data will be added to the more general field that is also used for other fields and the specific field is queried when
      one is aware of the exact field that must be addressed.
The actual xml document must add the lenya.index elements to the elements that must be indexed. The actual field
      name is specified in the xml document and not the specification. This 
makes filling the index more flexible, without
      making it harder to have a common indexfield for all document. Since all 
documents are created from a sample xml file,
      the default indexfields can be provided in this file. This way individual 
exceptions are still possible.

      The LenyaIndex parser, as described above, must be applied to the most 
used document in Lenya:
      the XHTML document that is extended with Dublin Core metadata.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html

I don't fully understand your example. Can you make one which shows the mapping to the Lucene document and a content example, e.g. a press release:

<pr>
<title>...</title>
<date>...</date>
<content>...</content>
</pr>

* Document boost By Adding an extra field to the metadata of the documents called 'Document
    Boost' it will be possible to use the boosting feature of Lucene to control
    the relevance of specific documents in the search results. A pulldown menu
    with a choosable digit to specify the boostlevel should be sufficient.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float)

  * Extract external links

The publish process should also extract all the external links - html and pdf - from the document and add them to the nutch crawler, so they can be fetched and indexed in the next Nutch run.

    In a similar fashion, the external links should be removed from the Nutch 
fetch list  and the Lucene
    index when  deactivating a document.

how do you want to treat these external links?

* Nutch integration for external crawling

 It should be possible to add external pages to the Lucene index. For instance 
pages that are part
 of the website, but are not controlled by Lenya or external pages that contain 
related content. The
 crawling of these sites will not be a problem. Linking to external pages on 
one of the pages controlled
 should be enough to crawl these pages and add them to the lucene index.
* Schedule the nutch indexing task The indexing of the external pages that have been extracted as links during the indexing of a document
   are fetched and indexed by Nutch. These documents can be html or pdf ones, 
as Nutch is able to handle
   these types.

The list of links to index will be crawled and indexed by Nutch and added to the Lucene index. This process will be a scheduled job that will run from time to time, which can be controlled from the Lenya Administrator interface.

   http://lucene.apache.org/nutch/apidocs/net/nutch/fetcher/Fetcher.html
   
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/cms/usecase/scheduling/UsecaseCronJob.html
* Create Usecase for searching the current publication

The current search pipeline is not a part of a specific publication, but is part of the general lenya configuration. By making it a usecase, it will be more convenient to address
 the search facility from a html form and it will be easier to change the 
search needs
 of a specific publication. Another reason to move to usecases is the fact that 
Lenya 1.4
 makes standard use of these usecases.
Solprovider already has implemented a feature like this. In my opinion, it looks pretty good,
 but can be revised and simplified with the changes proposed in this document, 
especially the
replacement of the generator. http://www.solprovider.com/lenya/search
 
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/cms/search/usecases/Search.html
* Change the communiation of Lenya with Lucene The communication of Lenya with the Lucene index is pretty nasty at the moment. The current approach is the use of a custom xsp page, that contains server processed java code that communicates with the Lucene API. This code is not very flexible nor extendable programmed.

it is flexible but I also don't like the XSP and you right it's horrible
to change things

 Making small changes to the result set can take a very long time to implement.

Different approaches to change this are possible: using the Cocoon LuceneQueryBean, that makes all Lucene search features available to any Cocoon application, or the use of a custom navigational component and the standard Cocoon search generator. The latter approach seems the most appropriate to me, because of the highly customizable nature of Lenya that only needs knowledge of XSLT. The LuceneQueryBean offers possibilities for both common and advanced uses, but seems to lack the customization that a navigation component based on a xslt sheet only can offer.
 http://lenya.apache.org/apidocs/1.4/org/apache/lenya/lucene/index/Index.html
* Replace custom Lucene search generator with Cocoon Search generator

There is a very clean and easy alternative to this nasty xsp page the xslt sheets that process the result it: the Cocoon search generator
  By using this generator instead of the clumpsy search pipeline currently
employed, it will be easier to debug or change the resultset for a specific publication. Besides this, it seems to me as a good practice
  to take advantage of Cocoon's facilities as much as possible.

  http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html

how does the XML of the search-generator differ from the the current Lenya
implementation?

 * Simplify the current search navigation component

Make the current search form more usable, visually attractive and easier to integrate to a publication. Change the current navigation component - search.xsl - to be compatible with
  the new interface and change its apperance.
* Related navigation component

 Besides the results of a explicit query of the user, it could be interesting 
to add a navigation
 component that searches the Lucene index for related pages. This could be done 
on the subject or
 the description fields of the document. The results can be integrated in the 
document as a flexible
 way of navigation trough the publication.


* Planning

14 june 05:             Proposal deadline
24 june 05:             Acceptance or rejection of proposal
06 july 05:             Index when publishing
06 july 05              Remove when deactivated
14 july 05:             Document parser
                         indexfields
                         boost
                         external links
21 july 05:             Nutch integration
28 july 05:             Search usecase
28 july 05:             SearchGenerator
28 july 05:             Search navigation component
28 july 05:             Related navigation component
01 sept 05:             Pencils down

* Future consideration

These considerations are no formal requirements of this proposal, but are 
sidetracks that could play
a role in future developments. By writing them down, they become part of the 
considerations for the current
proposal without being a direct goal of the project as described above itself.

* Add Lucene indexviewer *

To have an overvieuw of the created index it should be fairly simple to integrate the indexviewer Limo (http://limo.sourceforge.net/) to the administration mode of the
 Lenya interface. The viewer is an easy tool to dig into the created index when 
the
 search results are different than you expected. This tool is indispensable 
when working
 with the ConfigurableIndexer to have an overview of the created Lucene fields 
and their
 content.

 The tool is written as an Apache Licensed java servlet and the only information
it needs to function is the path to the Lucene index. The integration should therefor be fairly easy.

yes, this could go into the admin area

* Jackrabbit and Lucene

 The role of Jackrabbit seems to apply to more structured queries as XQuery 
makes possible. The unstructured
fulltext searching, as non-computers will use most of the time, is the area of the Lucene engine.
 When the Lenya API will be changed to make use of all the features that 
Jackrabbit promisses us, the document
parser as proposed above will have to be moved to the Lucene interface of Jackrabbit. Jackrabbit will be responsible for a job that, for the time being, will be executed by Lenya.

 At this point of time, the Jackrabbit integration is only a future 
consideration and should be given account
 for when developing new features. The document parser will be developed with 
the Jackrabbit API in mind.


yes, it makes sense to keep an eye on JCR/Jackrabbit

Michi

------------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to