Hi,

Thanks for actually reading it and giving a thorough reply!


* Integrate the indexing process with the Lenya publishing usecases

 * Index the document when published

When a document is published it should be added to the Lucene index immediately. This can be accomplished by extending the publish process, which is implemented
   as a Lenya 1.4 usecase.

http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Publish.html * Remove the document from the index when deactivated Documents that are no longer a part of the 'Live' section of the Lenya publication (the public available website) should be immediately removed from the Lucene index. In a similar fashion as the publishing of a document, the deactivate usecase of Lenya 1.4 should be extended with a removal of the document of the Lucene index.

I think this could be done more general, such that every time a document
is changing is being indexed, e.g. also after editing, whereas there could
be one index for the authoring and one index for the live area.

If the document changes, it will be reindexed. I don't really see the need of a seperate index for every area.

Even more general would to search the documents in association with the workflow, but this would probably rather be Lenya 1.>4, but I am mentioning it to point
where I think it would make sense to head to



http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Deactivate.html
     * Document parser          * lenya.index
Lucene comes packed with a standard xml and html parser to add documents to the index. This parser fetches the data out of the document and stores this data in different fields of the Lucene index. The documents that Lenya works with are extended XHTML documents that can be parsed with the standard html parser, but they would lack the possibility of indexing the
      metadata that comes with these Lenya documents.

As a replacement for the ConfigurableIndexer that creates indexes from a document based on a collection of xpath statements, i would like to propose an alternative way of configuring the indexed data. This replacement would consist of tags in the internal xml documents of Lenya. Every xml element that must be added to the index need a special
      attribute, something like indexField="fieldName".

I don't think the ConfigurableIndexer should be replaced, whereas I am not
saying the implementation of it is great. One wants to keep the definition centralized and attached individually (just is the case for the workflow or validation schema of a document) Always the same problem ;-)

IIUC then every document would have to be tagged. What if a field is changing?!

I am not saying your suggestion doesn't make sense for certain cases, but
I wouldn't treat it as replacement, but rather as enhancement


I would agree with the term enhancement.

One of the big advantages of this approach would be the availability of data that isn't visible for the outside world, but could be helpful for the search mechanism to determine the most relevant results. One could think of the metadata that isn't completely rendered to html, like the date of creation or the creator.

Besides this, it would be more easy to add a new document type to Lenya when the indexing of the document can be specified in the sample document and the Relax NG schema.

adding the indexing to the Schema would probably make life easier but is basically the same as the current solution (one transformation from one to the other).

Every document in Lenya has an accompanying RelaxNG schema that validates every edited document when
      it is saved.


I don't think a document should require a schema, but I guess we get into a religious war here. But you can definitely not assume that everything is validated by RelaxNG, because Lenya would close itself badly if it would neglect schemas like XSD and others ...


On the one hand you like the centralized definition of the index, as you propose to add the indexing to the schema and on the other hand, you like to keep the schema requirement as flexible as possible. I see the dilemma and that's why i think my idea is a nice way to keep some sort of flexibility on the schema side, but with a centralized definition in the form of the samplefile.

Changing the fields would require a change of the 'obsolete' xml documents, but i think this is a rare case that should actually be avoided. Fields can be added or fields can become obsolete without a problem, but changing a field is something that is done rarely, if ever. Could you give me a scenario where this would be an urgent problem?

This schema should allow a document to have the index attribute assigned to a number of elements. These elements should be extended with the lenya.index attribute. This must be done for all elements that are allowed to be added to the Lucene index. This may sound like a lot work, but it shouln't be that hard. An XHTML document, for instance, only needs several metadata and the body elements
      to be specified.

The following Relax NG snippet should be added to all elements that can be indexed. The LenyaFieldName will
      contain the name of the Lucene index field.
      <define name="lenya:index">
       <ZeroOrMore>
        <element name="lenya.index">
         <text/>
        </element>
       </ZeroOrMore>
      </define>

Notice the possibility to add more than one LenyaIndexFieldName. This makes it possible to add the same data to different fields, which can be useful when the user want a general or more specific search: the data will be added to the more general field that is also used for other fields and the specific field is queried when
      one is aware of the exact field that must be addressed.
The actual xml document must add the lenya.index elements to the elements that must be indexed. The actual field name is specified in the xml document and not the specification. This makes filling the index more flexible, without making it harder to have a common indexfield for all document. Since all documents are created from a sample xml file, the default indexfields can be provided in this file. This way individual exceptions are still possible.

The LenyaIndex parser, as described above, must be applied to the most used document in Lenya:
      the XHTML document that is extended with Dublin Core metadata.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html


I don't fully understand your example. Can you make one which shows the mapping to the Lucene document and a content example, e.g. a press release:

<pr>
<title>...</title>
<date>...</date>
<content>...</content>
</pr>


Well, this is just a first shot. I will probably change it, but something like this:

<pr>
<title>
<lenya:index>title</lenya:index>
Lenya 14 release preponed
</title>
<content>
<lenya:index>contents</lenya:index>
The release of Lenya 1.4, the Apache Content Management System, ladila
</content>
</pr>


                 * Document boost
By Adding an extra field to the metadata of the documents called 'Document Boost' it will be possible to use the boosting feature of Lucene to control the relevance of specific documents in the search results. A pulldown menu with a choosable digit to specify the boostlevel should be sufficient. http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float)

  * Extract external links

The publish process should also extract all the external links - html and pdf - from the document and add them to the nutch crawler, so they can be fetched and indexed in the next Nutch run.

In a similar fashion, the external links should be removed from the Nutch fetch list and the Lucene
    index when  deactivating a document.

how do you want to treat these external links?

I want to fetch the links in the document parser and let Nutch fetch them when the scheduled index process will run. I am not sur yet if i can feed them to nutch directly or that i should add them to a text file that nutch uses. I will give it another look.


 * Replace custom Lucene search generator with Cocoon Search generator

There is a very clean and easy alternative to this nasty xsp page the xslt sheets that process the result it: the Cocoon search generator By using this generator instead of the clumpsy search pipeline currently employed, it will be easier to debug or change the resultset for a specific publication. Besides this, it seems to me as a good practice
  to take advantage of Cocoon's facilities as much as possible.

  http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html

how does the XML of the search-generator differ from the the current Lenya
implementation?

As far as i can see, it contains all the output one can ask for from a Lucene query. The nice thing is: it possible to scatter the result in different pages. The links to all pages are delivered with the output. It looks pretty comprehensive to me.

Again, thanks for the reply!

Regards, Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to