Re: lenya-search proposal

Robert Goene Sun, 12 Jun 2005 16:06:09 -0700

Hi,

Thanks for actually reading it and giving a thorough reply!

* Integrate the indexing process with the Lenya publishing usecases

 * Index the document when published
When a document is published it should be added to the Lucene indeximmediately.This can be accomplished by extending the publish process, which isimplemented
   as a Lenya 1.4 usecase.
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Publish.html* Remove the document from the index when deactivatedDocuments that are no longer a part of the 'Live' section of theLenya publication(the public available website) should be immediately removed fromthe Lucene index.In a similar fashion as the publishing of a document, thedeactivate usecase ofLenya 1.4 should be extended with a removal of the document of theLucene index.
I think this could be done more general, such that every time a document
is changing is being indexed, e.g. also after editing, whereas there could
be one index for the authoring and one index for the live area.

If the document changes, it will be reindexed. I don't really see theneed of a seperate index for every area.

Even more general would to search the documents in association with theworkflow,but this would probably rather be Lenya 1.>4, but I am mentioning it topoint
where I think it would make sense to head to

http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Deactivate.html
     * Document parser          * lenya.index
Lucene comes packed with a standard xml and html parser toadd documents to the index. This parser fetches the data out ofthe document and stores this data in different fields of theLucene index. The documents that Lenya works with are extendedXHTML documents that can beparsed with the standard html parser, but they would lack thepossibility of indexing the
      metadata that comes with these Lenya documents.
As a replacement for the ConfigurableIndexer that createsindexes from a document basedon a collection of xpath statements, i would like to propose analternative way of configuring the indexed data. Thisreplacement would consist of tags in the internal xmldocuments of Lenya. Every xml element that must be added to theindex need a special
      attribute, something like indexField="fieldName".
I don't think the ConfigurableIndexer should be replaced, whereas I am not
saying the implementation of it is great. One wants to keep thedefinition centralizedand attached individually (just is the case for the workflow orvalidation schema of a document) Always the same problem ;-)
IIUC then every document would have to be tagged. What if a field ischanging?!
I am not saying your suggestion doesn't make sense for certain cases, but
I wouldn't treat it as replacement, but rather as enhancement


I would agree with the term enhancement.

One of the big advantages of this approach would be theavailability of data that isn'tvisible for the outside world, but could be helpful for thesearch mechanism to determinethe most relevant results. One could think of the metadata thatisn't completely rendered to html, like the date of creation orthe creator.
Besides this, it would be more easy to add a new document typeto Lenya when the indexing of the document can be specified inthe sample document and the Relax NG schema.
adding the indexing to the Schema would probably make life easier but isbasicallythe same as the current solution (one transformation from one to theother).
Every document in Lenya has an accompanying RelaxNGschema that validates every edited document when
      it is saved.
I don't think a document should require a schema, but I guess we getinto a religious war here. But you can definitely not assume thateverything is validated by RelaxNG, because Lenya would close itselfbadly if it would neglect schemas like XSD and others ...

On the one hand you like the centralized definition of the index, as youpropose to add the indexing to the schema and on the other hand, youlike to keep the schema requirement as flexible as possible. I see thedilemma and that's why i think my idea is a nice way to keep some sortof flexibility on the schema side, but with a centralized definition inthe form of the samplefile.

Changing the fields would require a change of the 'obsolete' xmldocuments, but i think this is a rare case that should actually beavoided. Fields can be added or fields can become obsolete without aproblem, but changing a field is something that is done rarely, if ever.Could you give me a scenario where this would be an urgent problem?

This schema should allow a document to have the index attributeassigned to a number ofelements. These elements should be extended with the lenya.indexattribute. This must be done for allelements that are allowed to be added to the Lucene index. Thismay sound like a lot work, but it shouln't be that hard. AnXHTML document, for instance, only needs several metadata and the bodyelements
      to be specified.
The following Relax NG snippet should be added to all elementsthat can be indexed. The LenyaFieldName will
      contain the name of the Lucene index field.
      <define name="lenya:index">
       <ZeroOrMore>
        <element name="lenya.index">
         <text/>
        </element>
       </ZeroOrMore>
      </define>
Notice the possibility to add more than one LenyaIndexFieldName.This makes it possible to add the same datato different fields, which can be useful when the user want ageneral or more specific search: the data will be added to themore general field that is also used for other fields and the specificfield is queried when
      one is aware of the exact field that must be addressed.
The actual xml document must add the lenya.index elements tothe elements that must be indexed. The actual fieldname is specified in the xml document and not the specification.This makes filling the index more flexible, withoutmaking it harder to have a common indexfield for all document.Since all documents are created from a sample xml file,the default indexfields can be provided in this file. This wayindividual exceptions are still possible.
The LenyaIndex parser, as described above, must be applied tothe most used document in Lenya:
      the XHTML document that is extended with Dublin Core metadata.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html
I don't fully understand your example. Can you make one which shows themapping to the Lucene document and a content example, e.g. a press release:
<pr>
<title>...</title>
<date>...</date>
<content>...</content>
</pr>

Well, this is just a first shot. I will probably change it, butsomething like this:


<pr>
<title>
<lenya:index>title</lenya:index>
Lenya 14 release preponed
</title>
<content>
<lenya:index>contents</lenya:index>
The release of Lenya 1.4, the Apache Content Management System, ladila
</content>
</pr>

                 * Document boost
By Adding an extra field to the metadata of the documents called'DocumentBoost' it will be possible to use the boosting feature of Luceneto controlthe relevance of specific documents in the search results. Apulldown menuwith a choosable digit to specify the boostlevel should besufficient.http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float)
  * Extract external links
The publish process should also extract all the external links -html and pdf - from the document and add them to the nutchcrawler, so they can be fetched and indexed in the next Nutch run.
In a similar fashion, the external links should be removed fromthe Nutch fetch list and the Lucene
    index when  deactivating a document.
how do you want to treat these external links?

I want to fetch the links in the document parser and let Nutch fetchthem when the scheduled index process will run. I am not sur yet if ican feed them to nutch directly or that i should add them to a text filethat nutch uses. I will give it another look.

 * Replace custom Lucene search generator with Cocoon Search generator
There is a very clean and easy alternative to this nasty xsp pagethe xslt sheets that process the result it: the Cocoon search generatorBy using this generator instead of the clumpsy search pipelinecurrentlyemployed, it will be easier to debug or change the resultset for aspecific publication. Besides this, it seems to me as a good practice
  to take advantage of Cocoon's facilities as much as possible.

  http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html
how does the XML of the search-generator differ from the the current Lenya
implementation?

As far as i can see, it contains all the output one can ask for from aLucene query. The nice thing is: it possible to scatter the result indifferent pages. The links to all pages are delivered with the output.It looks pretty comprehensive to me.


Again, thanks for the reply!

Regards, Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lenya-search proposal

Reply via email to