Hi,
Thanks for actually reading it and giving a thorough reply!
* Integrate the indexing process with the Lenya publishing usecases
* Index the document when published
When a document is published it should be added to the Lucene index
immediately.
This can be accomplished by extending the publish process, which is
implemented
as a Lenya 1.4 usecase.
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Publish.html
* Remove the document from the index when deactivated
Documents that are no longer a part of the 'Live' section of the
Lenya publication
(the public available website) should be immediately removed from
the Lucene index.
In a similar fashion as the publishing of a document, the
deactivate usecase of
Lenya 1.4 should be extended with a removal of the document of the
Lucene index.
I think this could be done more general, such that every time a document
is changing is being indexed, e.g. also after editing, whereas there could
be one index for the authoring and one index for the live area.
If the document changes, it will be reindexed. I don't really see the
need of a seperate index for every area.
Even more general would to search the documents in association with the
workflow,
but this would probably rather be Lenya 1.>4, but I am mentioning it to
point
where I think it would make sense to head to
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Deactivate.html
* Document parser * lenya.index
Lucene comes packed with a standard xml and html parser to
add documents to the index. This parser fetches the data out of
the document and stores this data in different fields of the
Lucene index. The documents that Lenya works with are extended
XHTML documents that can be
parsed with the standard html parser, but they would lack the
possibility of indexing the
metadata that comes with these Lenya documents.
As a replacement for the ConfigurableIndexer that creates
indexes from a document based
on a collection of xpath statements, i would like to propose an
alternative way of configuring the indexed data. This
replacement would consist of tags in the internal xml
documents of Lenya. Every xml element that must be added to the
index need a special
attribute, something like indexField="fieldName".
I don't think the ConfigurableIndexer should be replaced, whereas I am not
saying the implementation of it is great. One wants to keep the
definition centralized
and attached individually (just is the case for the workflow or
validation schema of a document) Always the same problem ;-)
IIUC then every document would have to be tagged. What if a field is
changing?!
I am not saying your suggestion doesn't make sense for certain cases, but
I wouldn't treat it as replacement, but rather as enhancement
I would agree with the term enhancement.
One of the big advantages of this approach would be the
availability of data that isn't
visible for the outside world, but could be helpful for the
search mechanism to determine
the most relevant results. One could think of the metadata that
isn't completely rendered to html, like the date of creation or
the creator.
Besides this, it would be more easy to add a new document type
to Lenya when the indexing of the document can be specified in
the sample document and the Relax NG schema.
adding the indexing to the Schema would probably make life easier but is
basically
the same as the current solution (one transformation from one to the
other).
Every document in Lenya has an accompanying RelaxNG
schema that validates every edited document when
it is saved.
I don't think a document should require a schema, but I guess we get
into a religious war here. But you can definitely not assume that
everything is validated by RelaxNG, because Lenya would close itself
badly if it would neglect schemas like XSD and others ...
On the one hand you like the centralized definition of the index, as you
propose to add the indexing to the schema and on the other hand, you
like to keep the schema requirement as flexible as possible. I see the
dilemma and that's why i think my idea is a nice way to keep some sort
of flexibility on the schema side, but with a centralized definition in
the form of the samplefile.
Changing the fields would require a change of the 'obsolete' xml
documents, but i think this is a rare case that should actually be
avoided. Fields can be added or fields can become obsolete without a
problem, but changing a field is something that is done rarely, if ever.
Could you give me a scenario where this would be an urgent problem?
This schema should allow a document to have the index attribute
assigned to a number of
elements. These elements should be extended with the lenya.index
attribute. This must be done for all
elements that are allowed to be added to the Lucene index. This
may sound like a lot work, but it shouln't be that hard. An
XHTML document, for instance, only needs several metadata and the body
elements
to be specified.
The following Relax NG snippet should be added to all elements
that can be indexed. The LenyaFieldName will
contain the name of the Lucene index field.
<define name="lenya:index">
<ZeroOrMore>
<element name="lenya.index">
<text/>
</element>
</ZeroOrMore>
</define>
Notice the possibility to add more than one LenyaIndexFieldName.
This makes it possible to add the same data
to different fields, which can be useful when the user want a
general or more specific search: the data will be added to the
more general field that is also used for other fields and the specific
field is queried when
one is aware of the exact field that must be addressed.
The actual xml document must add the lenya.index elements to
the elements that must be indexed. The actual field
name is specified in the xml document and not the specification.
This makes filling the index more flexible, without
making it harder to have a common indexfield for all document.
Since all documents are created from a sample xml file,
the default indexfields can be provided in this file. This way
individual exceptions are still possible.
The LenyaIndex parser, as described above, must be applied to
the most used document in Lenya:
the XHTML document that is extended with Dublin Core metadata.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html
I don't fully understand your example. Can you make one which shows the
mapping to the Lucene document and a content example, e.g. a press release:
<pr>
<title>...</title>
<date>...</date>
<content>...</content>
</pr>
Well, this is just a first shot. I will probably change it, but
something like this:
<pr>
<title>
<lenya:index>title</lenya:index>
Lenya 14 release preponed
</title>
<content>
<lenya:index>contents</lenya:index>
The release of Lenya 1.4, the Apache Content Management System, ladila
</content>
</pr>
* Document boost
By Adding an extra field to the metadata of the documents called
'Document
Boost' it will be possible to use the boosting feature of Lucene
to control
the relevance of specific documents in the search results. A
pulldown menu
with a choosable digit to specify the boostlevel should be
sufficient.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float)
* Extract external links
The publish process should also extract all the external links -
html and pdf - from the document and add them to the nutch
crawler, so they can be fetched and indexed in the next Nutch run.
In a similar fashion, the external links should be removed from
the Nutch fetch list and the Lucene
index when deactivating a document.
how do you want to treat these external links?
I want to fetch the links in the document parser and let Nutch fetch
them when the scheduled index process will run. I am not sur yet if i
can feed them to nutch directly or that i should add them to a text file
that nutch uses. I will give it another look.
* Replace custom Lucene search generator with Cocoon Search generator
There is a very clean and easy alternative to this nasty xsp page
the xslt sheets that process the result it: the Cocoon search generator
By using this generator instead of the clumpsy search pipeline
currently
employed, it will be easier to debug or change the resultset for a
specific publication. Besides this, it seems to me as a good practice
to take advantage of Cocoon's facilities as much as possible.
http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html
how does the XML of the search-generator differ from the the current Lenya
implementation?
As far as i can see, it contains all the output one can ask for from a
Lucene query. The nice thing is: it possible to scatter the result in
different pages. The links to all pages are delivered with the output.
It looks pretty comprehensive to me.
Again, thanks for the reply!
Regards, Robert
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]