Bernhard, perfect timing! I was thinking about the same thing the other
day.

Bernhard Huber wrote:
> 
> hi,
> I'm taking a look at lunce, a nice search engine.
> As Cocoon2 claims to be an XML publishing engine,
> some sort of searching feature would be quite nice.

Yes, this is very true.
 
> Now I'm a bit confused how to make it usabel under Cocoon2.
> Should I write a generator for the searching part of lucene?
> Should I encapsulate the indexing, and searching as
> an avalon component?

In a perfect world (but we aim for that, right?) we should have an
abstracted search engine behavioral interface (future compatible with
semantic capabilities?) and then have an Avalon component (block?) to
implement that.

Then, a cocoon component (a generator or a transformer, depending on the
syntax of the query language being XML or not) can use the avalon
component to power itself and generate the XML event stream.

Note that both Lucene and dbXML (probably going to be called Apache
Xindice, from the latin word "indice" -> "index") could power this: the
first as an indexer of the textual part (final pipeline results) while
the second being an indexer of the semantic part (starting pipeline
sources).

Obviously, a semantic approach is very likely to yield much better
results, but it requires a completely different way of doing search
(look at xyzsearch.com, for example), while lucene is simply doing
textual heuristics.

This said, it's also likely that the two approaches are so different
that a single behavioral interface will be either too general or too
simple to cover both cases, so, probably, both a textual search
interface and a markup search interface will be required.
 
> How should I index?

Eh, good question :)

My suggestion would be to connect the same xlink-based crawling
subsystem used for CLI to lucene as it was a file system, but this might
require some Inversion of Control (us pushing files into lucene and not
lucene to crawl them or read them from disk) thus some code changes to
it.

> Let's say I want to provide one or more sub-sitemaps
> a searching feature, and let's say the index is already
> generated, how can i calculate from the internal sitemap URL
> to public browser-URL?
> 
> For example I have an index over all /docs/samples/*/* files,
> how can I detect that they are all mapped to the URL http://machine/*/*?
> 
> any ideas are welcome?

The CLI subsystem works by starting at a URI, asking for the "link" view
of that URI (cocoon will then return a newline-separated list of linked
URIs created out of all those links that contain xlink:href="" or src=""
or href="" attributes), then recursively call itself on every linked
URI. 

When it reaches a leaf (a page with no further links or links that were
already visited), it asks for the "link-translated" view of the URI,
passing in POST to the request the new-line separated list of links so
that Cocoon knows how to regenerate an adapted version of the resource
(this is useful to maintain link consistency when moved on a file system
and workign on the original link semantics, it works for every file
format, even for PDF, because link translation happens transparently
before serialization takes place).

Last operation is URI mangling where, depending on the give MIME-type of
the returned resource, the proper extension is added to the file name
and the resource is saved on disk.

Another important feature is that the "link" view also indicates as
"dynamic" those links that have a particular xlink role (behavior)
xlink:role="dynamic", so they are skipped by the CLI generation and a
placeholder is written (that might redirect to the original URI, for
example).

So, currently, indexers like lucene assume that what goes out of a web
server is what is already in (at least, for static pages). Cocoon
doesn't work that way.

So, the indexer should crawl from the end side (the web side, just like
big search engine do) and don't assume anything about how the files are
generated internally.

The only different is that Cocoon implements a standard behavior of
resource views and we can use those to gain more information about the
requests without missing the semantic information that cocoon already
stores (such as the xlink information).

So, IMO, the most elegant and effective solution would be to connect
lucene to the cocoon view-based crawling subsystem:

 1) start with some URI (the root, mostly)
 2) obtain the link view of the resource
 3) recursively call itself on non-dynamic links until a leaf is reached
 4) obtain the leaf resource (performing translation to adapt the
cocoon-relative URIs to the site-relative URIs) and push it into lucene
 5) continue until all leafs are processed.

Note that "dynamic" has a different sense that before and it means that
the resource result is not dependent on request-based or environmental
parameters (such as user-agent, date, time, machine load, IP address,
whatever). A resource that is done aggregating a ton of documents stored
on a database must be considered static if it is not dependent of
request parameters.

For a semantic crawler, instead of asking for the "standard" view, it
would ask for semantic-specific views such as "content" (the most
semantic stage at pipeline generation, which we already specify in our
example sitemaps) or "schema" (not currently implemented as nobody would
use it today anyway).

But the need of resource "views" is the key to the success of proper
search capabililities and we must be sure that we use them even for
semantically-poor searching solutions like lucene, but that would kick
ass anyway on small to medium size web sites.

Hope this helps and if you have further questions, don't mind asking.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to