Robert Goene wrote:
Hi,
Hereby my new proposal. I have done some more research and have
provided a bit more background information and a rudimentary timeline.
Hope you like the looks of it, because the clock is ticking!
please see my commente below, whereas they are mostly comments on
implementations
Regards, Robert
------------------------------------------------------------------------
* Google Summer of Code proposal *
Version: Third draft version
Date: 12 june 2005
Subject: Apache's lenya-search project
Intended audience: Current maintainers and potential mentor(s)
Author: Robert Goene, University of Amsterdam, The Netherlands
* Project Overview
The Lenya-Search project is part of the Lenya Content Management System, as
hosted by the Apache Foundation. Heavily based on the XML publishing framework
Cocoon, Lenya combines an easy interface for the end-user with advanced
possibilities for the xml-aware developer. This makes Lenya both a good choice
for straight-forward and more complex websites.
The search facilities of Lenya are based on the Apache project Lucene. This
search engine takes care of the indexing of documents and processing of the
queries.
The lenya-search project has found her objective in the integration of Lenya
and Lucene. The current integration is not as easy and flexible as it should
be for a complete CMS. The indexing process, for instance, depends on a number
of home-made indexers that take care of adding all documents to Lucene. This
process must be started manually trough an ant job. The indexers are not
flexible enough and should be more focussed on the documents that Lenya is
dealing with: xhtml documents with Dublin Core metadata. Besides this, custom
documents should easily be added to the CMS. Lenya should be able to handle
xml documents of all kinds in a more straightforward way. This proposal is
part of this more general goal.
In other words: the search facilities should be further integrated in Lenya.
The search possibilities are not trivial to use in a Lenya publication, and
they obviously should be.
The development will be based on the current trunk of the project: version 1.4
This major release contains a large number of architectual changes. A change
like
the one described here is appropriate to add to this new future release. The
current stable version (1.2) will only be updated with crucial bugfixes. No
significant new features will be added.
http://lenya.apache.org/1_4/index.html
* Project description
The project will consist of a number of subprojects, which can be
developed fairly isolated from each other. This section will give
a functional description and an overview of the techniques used for
each individual subproject.
* Integrate the indexing process with the Lenya publishing usecases
* Index the document when published
When a document is published it should be added to the Lucene index
immediately.
This can be accomplished by extending the publish process, which is
implemented
as a Lenya 1.4 usecase.
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Publish.html
* Remove the document from the index when deactivated
Documents that are no longer a part of the 'Live' section of the Lenya publication
(the public available website) should be immediately removed from the Lucene
index.
In a similar fashion as the publishing of a document, the deactivate usecase
of
Lenya 1.4 should be extended with a removal of the document of the Lucene
index.
I think this could be done more general, such that every time a document
is changing is being indexed, e.g. also after editing, whereas there could
be one index for the authoring and one index for the live area.
Even more general would to search the documents in association with the
workflow,
but this would probably rather be Lenya 1.>4, but I am mentioning it to
point
where I think it would make sense to head to
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Deactivate.html
* Document parser
* lenya.index
Lucene comes packed with a standard xml and html parser to add documents to the index. This
parser fetches the data out of the document and stores this data in different fields of the
Lucene index. The documents that Lenya works with are extended XHTML
documents that can be
parsed with the standard html parser, but they would lack the possibility
of indexing the
metadata that comes with these Lenya documents.
As a replacement for the ConfigurableIndexer that creates indexes from a
document based
on a collection of xpath statements, i would like to propose an alternative way of
configuring the indexed data. This replacement would consist of tags in the internal xml
documents of Lenya. Every xml element that must be added to the index
need a special
attribute, something like indexField="fieldName".
I don't think the ConfigurableIndexer should be replaced, whereas I am not
saying the implementation of it is great. One wants to keep the
definition centralized
and attached individually (just is the case for the workflow or
validation schema of a document) Always the same problem ;-)
IIUC then every document would have to be tagged. What if a field is
changing?!
I am not saying your suggestion doesn't make sense for certain cases, but
I wouldn't treat it as replacement, but rather as enhancement
One of the big advantages of this approach would be the availability of
data that isn't
visible for the outside world, but could be helpful for the search
mechanism to determine
the most relevant results. One could think of the metadata that isn't completely rendered
to html, like the date of creation or the creator.
Besides this, it would be more easy to add a new document type to Lenya when the indexing of the
document can be specified in the sample document and the Relax NG schema.
adding the indexing to the Schema would probably make life easier but is
basically
the same as the current solution (one transformation from one to the other).
Every document in Lenya has an accompanying RelaxNG schema that validates every edited document when
it is saved.
I don't think a document should require a schema, but I guess we get
into a religious war here. But you can definitely not assume that
everything is validated by RelaxNG, because Lenya would close itself
badly if it would neglect schemas like XSD and others ...
This schema should allow a document to have the index attribute assigned to a
number of
elements. These elements should be extended with the lenya.index
attribute. This must be done for all
elements that are allowed to be added to the Lucene index. This may sound like a lot work, but it
shouln't be that hard. An XHTML document, for instance, only needs several metadata and the body elements
to be specified.
The following Relax NG snippet should be added to all elements that can
be indexed. The LenyaFieldName will
contain the name of the Lucene index field.
<define name="lenya:index">
<ZeroOrMore>
<element name="lenya.index">
<text/>
</element>
</ZeroOrMore>
</define>
Notice the possibility to add more than one LenyaIndexFieldName. This
makes it possible to add the same data
to different fields, which can be useful when the user want a general or more specific search: the data will
be added to the more general field that is also used for other fields and the specific field is queried when
one is aware of the exact field that must be addressed.
The actual xml document must add the lenya.index elements to the elements that must be indexed. The actual field
name is specified in the xml document and not the specification. This
makes filling the index more flexible, without
making it harder to have a common indexfield for all document. Since all
documents are created from a sample xml file,
the default indexfields can be provided in this file. This way individual
exceptions are still possible.
The LenyaIndex parser, as described above, must be applied to the most
used document in Lenya:
the XHTML document that is extended with Dublin Core metadata.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html
I don't fully understand your example. Can you make one which shows the
mapping to the Lucene document and a content example, e.g. a press release:
<pr>
<title>...</title>
<date>...</date>
<content>...</content>
</pr>
* Document boost
By Adding an extra field to the metadata of the documents called 'Document
Boost' it will be possible to use the boosting feature of Lucene to control
the relevance of specific documents in the search results. A pulldown menu
with a choosable digit to specify the boostlevel should be sufficient.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float)
* Extract external links
The publish process should also extract all the external links - html and pdf - from the document and
add them to the nutch crawler, so they can be fetched and indexed in the next Nutch run.
In a similar fashion, the external links should be removed from the Nutch
fetch list and the Lucene
index when deactivating a document.
how do you want to treat these external links?
* Nutch integration for external crawling
It should be possible to add external pages to the Lucene index. For instance
pages that are part
of the website, but are not controlled by Lenya or external pages that contain
related content. The
crawling of these sites will not be a problem. Linking to external pages on
one of the pages controlled
should be enough to crawl these pages and add them to the lucene index.
* Schedule the nutch indexing task
The indexing of the external pages that have been extracted as links during the indexing of a document
are fetched and indexed by Nutch. These documents can be html or pdf ones,
as Nutch is able to handle
these types.
The list of links to index will be crawled and indexed by Nutch and added to the Lucene index. This
process will be a scheduled job that will run from time to time, which can be controlled from the
Lenya Administrator interface.
http://lucene.apache.org/nutch/apidocs/net/nutch/fetcher/Fetcher.html
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/cms/usecase/scheduling/UsecaseCronJob.html
* Create Usecase for searching the current publication
The current search pipeline is not a part of a specific publication, but is part of the
general lenya configuration. By making it a usecase, it will be more convenient to address
the search facility from a html form and it will be easier to change the
search needs
of a specific publication. Another reason to move to usecases is the fact that
Lenya 1.4
makes standard use of these usecases.
Solprovider already has implemented a feature like this. In my opinion, it looks pretty good,
but can be revised and simplified with the changes proposed in this document,
especially the
replacement of the generator.
http://www.solprovider.com/lenya/search
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/cms/search/usecases/Search.html
* Change the communiation of Lenya with Lucene
The communication of Lenya with the Lucene index is pretty nasty at the moment. The current
approach is the use of a custom xsp page, that contains server processed java code that
communicates with the Lucene API. This code is not very flexible nor extendable programmed.
it is flexible but I also don't like the XSP and you right it's horrible
to change things
Making small changes to the result set can take a very long time to implement.
Different approaches to change this are possible: using the Cocoon LuceneQueryBean, that
makes all Lucene search features available to any Cocoon application, or the use of a
custom navigational component and the standard Cocoon search generator.
The latter approach seems the most appropriate to me, because of the highly customizable nature
of Lenya that only needs knowledge of XSLT. The LuceneQueryBean offers possibilities for both common
and advanced uses, but seems to lack the customization that a navigation component based on a xslt sheet only
can offer.
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/lucene/index/Index.html
* Replace custom Lucene search generator with Cocoon Search generator
There is a very clean and easy alternative to this nasty xsp page the
xslt sheets that process the result it: the Cocoon search generator
By using this generator instead of the clumpsy search pipeline currently
employed, it will be easier to debug or change the resultset for a
specific publication. Besides this, it seems to me as a good practice
to take advantage of Cocoon's facilities as much as possible.
http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html
how does the XML of the search-generator differ from the the current Lenya
implementation?
* Simplify the current search navigation component
Make the current search form more usable, visually attractive and easier to integrate to
a publication. Change the current navigation component - search.xsl - to be compatible with
the new interface and change its apperance.
* Related navigation component
Besides the results of a explicit query of the user, it could be interesting
to add a navigation
component that searches the Lucene index for related pages. This could be done
on the subject or
the description fields of the document. The results can be integrated in the
document as a flexible
way of navigation trough the publication.
* Planning
14 june 05: Proposal deadline
24 june 05: Acceptance or rejection of proposal
06 july 05: Index when publishing
06 july 05 Remove when deactivated
14 july 05: Document parser
indexfields
boost
external links
21 july 05: Nutch integration
28 july 05: Search usecase
28 july 05: SearchGenerator
28 july 05: Search navigation component
28 july 05: Related navigation component
01 sept 05: Pencils down
* Future consideration
These considerations are no formal requirements of this proposal, but are
sidetracks that could play
a role in future developments. By writing them down, they become part of the
considerations for the current
proposal without being a direct goal of the project as described above itself.
* Add Lucene indexviewer *
To have an overvieuw of the created index it should be fairly simple to integrate the
indexviewer Limo (http://limo.sourceforge.net/) to the administration mode of the
Lenya interface. The viewer is an easy tool to dig into the created index when
the
search results are different than you expected. This tool is indispensable
when working
with the ConfigurableIndexer to have an overview of the created Lucene fields
and their
content.
The tool is written as an Apache Licensed java servlet and the only information
it needs to function is the path to the Lucene index. The integration should therefor be
fairly easy.
yes, this could go into the admin area
* Jackrabbit and Lucene
The role of Jackrabbit seems to apply to more structured queries as XQuery
makes possible. The unstructured
fulltext searching, as non-computers will use most of the time, is the area of the Lucene engine.
When the Lenya API will be changed to make use of all the features that
Jackrabbit promisses us, the document
parser as proposed above will have to be moved to the Lucene interface of Jackrabbit. Jackrabbit will be
responsible for a job that, for the time being, will be executed by Lenya.
At this point of time, the Jackrabbit integration is only a future
consideration and should be given account
for when developing new features. The document parser will be developed with
the Jackrabbit API in mind.
yes, it makes sense to keep an eye on JCR/Jackrabbit
Michi
------------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Michael Wechner
Wyona Inc. - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED] [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]