Robert Goene wrote:
I received an email from the other participant for the lenya-search
project. He told me he would withdraw from the competition (if that is
the correct term).
;)
I am working on a new version of the proposal to integrate the feedback
Gregor gave me. There are a few blank spots left for me:
Could you please tell me what the current view is on the use of
Jackrabbit? It is not completely clear to me what the role of Jackrabbit
is in Lenya. Is it only the sitemap and the workflow data or is it
supposed to be the general storage mechanism for all the documents?
eventually, jackrabbit will probably replace most, if not all, uses of
the file system in lenya. we'd use it to store content, the sitetree,
metadata about a document, wf metadata, revisions, ac nodes.
obviously, this will be done in stages, but it makes sense to
incorporate it into new designs.
The
role of nutch and lucene is not clear to me in the former situation and
the latter should imply a different approach to searching.
jackrabbit doesn't have full text search by itself, so lucene would be
used to index the repository. nutch is used for crawling external sites,
replacing the homegrown crawling code. for instance, the university of
zurich crawled all their sites with the lenya crawler to be able to have
unified search, no matter whether a site is managed by lenya or not.
If documents are stored in jackrabbit, a local filesystem, or some other
xml-storage device and the documents are indexed when they are saved,
what is the cooperation between Lucene and Jackrabbit? I don't see it
yet. I can imagine a query capability in jackrabbit, but wouldn't this
be a replacement of the searching facility?
jackrabbit does have some query capabilities, and lenya will make use of
them. queries like: give me all documents last modified this week are
ideally suited for jackrabbit.
lucene maintains a seperate index of all content in the repository,
which could be implemented by using
http://incubator.apache.org/jackrabbit/apidocs/org/apache/jackrabbit/core/query/lucene/package-summary.html
as to how the details should work (whether the lenya api notifies
lucence about changed documents, or jackrabbit has an observer that
calls lucene), i dunno, that is up to you.
Is the migration to jackrabbit a current issue and if it is, could you
give me more information (a discussion thread would do) on the design
considerations made?
the repository work has been discussed on
http://wiki.apache.org/lenya/ProposalRepository
and more recently, an integration with the sitetree as a first step was
commited to the sandbox. at the same time, the lenya api internals are
slowly being rewritten to get rid of direct java.io.File calls,
replacing them with avalon sources. this will allow to migrate further
parts of lenya to jackrabbit, when the time comes. work on that has not
started yet, though.
I have no idea what to investigate, let alone what
to solve for the integration of jackrabbit and nutch.
nutch and jackrabbit would have no integration. the data from nutch
would end up in a lucene index, just as the data from jackrabbit would.
Thanks a lot. I am looking forward to work on this project!
hope this helps
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]