Bernhard Huber wrote: > hi, david > thanks for the links to z39.50. > I read a bit about that protocol, but as I understand > supporting z39.50 might require to write an > avalon block implementing the z39.50 server, > that's at the moment a bit too much for me, > learning avalon in depth + z39.50, > anyway thanks!
It sure is big, Bernhard. I only meant to point out other options so that cocoon-dev did not jump to any conclusions about which software to use, and so that design allowed other options to be integrated by Cocoon end-users. These issues can be addressed, now that Stefano has started a separate thread "[RT] semantic searching". --David > ----- Originalnachricht ----- > Von: David Crossley <[EMAIL PROTECTED]> > Datum: Montag, Oktober 29, 2001 7:52 am > Betreff: Re: Subject: Lucene as Avalon Component? > > > Structured searching is an obvious beneficiary of a solid > > XML framework. Cocoon would need capability to allow > > such functionality to be implemented by any search system > > of choice. > > > > I would prefer to utilise the Z39.50 protocol (ISO 23950). > > This is stateful and session-based. It suports both fielded > > and full-text search. It has a powerful boolean and relational > > query syntax and various high-level abstractions. > > > > Importantly, there are sets of well-known attributes which > > shield the user from how the search is implemented and from > > how the XML records are structured. (Bernhard, this directly > > addresses your three numbered issues below.) > > > > Of course, this power comes at the cost of potentially > > complex implemention. However, this is eased by the > > availability of solid toolkits and fully blown servers/gateways > > (both open source and the other). > > > > This is the age-old search and retrieve protocol from the > > library world, so plenty of leverage can be gained. > > Start at: http://lcweb.loc.gov/z3950/agency/ > > Also follow their links to resources/ > > I see there at least one appropriate solution for Cocoon > > which is open source and Java (JZKit). > > > > Thanks Bernhard, for raising this important topic. > > --David Crossley > > > > Bernhard Huber wrote: > > > Stefano Mazzocchi wrote: > > > > Bernhard, perfect timing! I was thinking about the same thing > > the > > > > otherday. > > > > > > > > Bernhard Huber wrote: > > > > > > > > > > hi, > > > > > I'm taking a look at lunce, a nice search engine. > > > > > As Cocoon2 claims to be an XML publishing engine, > > > > > some sort of searching feature would be quite nice. > > > > > > > > Yes, this is very true. > > > > > > > > > Now I'm a bit confused how to make it usabel under Cocoon2. > > > > > Should I write a generator for the searching part of lucene? > > > > > Should I encapsulate the indexing, and searching as > > > > > an avalon component? > > > > > > > > In a perfect world (but we aim for that, right?) we should > > have an > > > > abstracted search engine behavioral interface (future > > compatible with > > > > semantic capabilities?) and then have an Avalon component > > (block?) to > > > > implement that. > > > > > > and the search-engine understands your queries, semantically :-) > > > But perhaps an advantage could be that a group of documents might > > > present already some semantic keywords, stored in the documents, > > > like author, and title. > > > So searching for this keywords will give very good results. > > > > > > > Then, a cocoon component (a generator or a transformer, > > depending > > > > on the > > > > syntax of the query language being XML or not) can use the avalon > > > > component to power itself and generate the XML event stream. > > > > > > Yup, that's would be nice. > > > Moreover we can use the XML event stream not only for generating > > > the answer of the search-query/request, but evaluate some hit > > > statistics. > > > > > > As the XML event stream can be handled as some static xml page > > source.> > > > > Note that both Lucene and dbXML (probably going to be called > > Apache> > Xindice, from the latin word "indice" -> "index") could > > power > > > > this: the > > > > first as an indexer of the textual part (final pipeline > > results) while > > > > the second being an indexer of the semantic part (starting > > pipeline> > sources). > > > > > > > > Obviously, a semantic approach is very likely to yield much better > > > > results, but it requires a completely different way of doing > > search> > (look at xyzsearch.com, for example), while lucene is > > simply doing > > > > textual heuristics. > > > I will try to check xyzsearch.com > > > > > > But I have some troubles with "semantic". > > > > > > As I would say "semantic" lies in the eye of the observer. > > > But that's more philosophical. > > > > > > Perhaps it would be interesting to gather some ideas, > > > about what's the aim of using semantic search. > > > > > > Although the simple textual search gives a lot of bad results, > > > it is simple to use. > > > > > > Using a semantic search should give better results, as the > > > elements are taken into account when generating an index, > > > and when evaluating the result of a query. > > > But some points to think about: > > > 1. What does to user should know already about the semantic of > > the > > > documents? > > > > > > 2. Does he/she have to know that a document has an author, for > > example?> > > > 3. Does he/she have to know that querying for author entering > > > "author:john" will search of the author's name. > > > > > > Perhaps all 3 issues are just a questing of design the GUI of > > > an semantic search... > > > > > > Just read now > > > http://localhost:8080/cocoon/documents/emotional-landscapes.html, > > > I see, semantic is taken the xml element's into account. > > > > > > > This said, it's also likely that the two approaches are so > > different> > that a single behavioral interface will be either too > > general or too > > > > simple to cover both cases, so, probably, both a textual search > > > > interface and a markup search interface will be required. > > > > > > > > > How should I index? > > > > > > > > Eh, good question :) > > > > > > > > My suggestion would be to connect the same xlink-based crawling > > > > subsystem used for CLI to lucene as it was a file system, but > > this > > > > mightrequire some Inversion of Control (us pushing files into > > > > lucene and not > > > > lucene to crawl them or read them from disk) thus some code > > > > changes to > > > > it. > > > I understand your hint. > > > I must admit that I never understood cocoon's view concept. > > > Now I see what I can do using views. > > > Perhaps adding an example in the view documentation, like > > > Try using: > > > http://localhost:8080/cocoon/welcome?cocoon-view=content, or > > > http://localhost:8080/cocoon/welcome?cocoon-view=links > > > would help a lot. > > > But perhaps I'm just a bit slow.... > > > > > > I never supposed to index the html result of an page, > > > but the xml content (ad fontes!). > > > Thus I was thinking about how to index a xml source. > > > > > > Or saying a more generally: > > > What would be a smart xml indexing strategy? > > > > > > Lets take an snippet of > > > http://localhost:8080/cocoon/documents/views.html?cocoon- > > view=content> > > > ----- begin > > > .... > > > <s1 title="The Views"> > > > <s2 title="Introduction"> > > > <p> Views are yet another sitemap component. Unlike the rest, they > > > are othogonal to the resource and pipeline definitions. In the > > > ... > > > <s3 title="View Processing"> > > > <p>The samples sitemap contains two view definitions. One of them > > > looks like the excerpt below.</p> > > > <source xml:space="preserve"> > > > > > > <map:views> > > > <map:view name="content" from-label="content"> > > > <map:serialize type="xml"/> > > > </map:view> > > > > > > </source> > > > .... > > > ----- end > > > > > > I see following options: > > > 1. Index only the bare text. That's simple, and stupid, > > > as a lot of info entered by the xml generator (human, program) > > > is ignored. > > > 2. Try to take the element's name, and/or attributes into account. > > > 3. Try to take the elements path into account. > > > > > > Let's see what queries an engine should answer: > > > ad 1. query: "Intro", result: all docs having text cocoon > > > > > > ad 2. query: "title:Intro", result: all docs having title > > elements with > > > text Intro. > > > > > > ad 2. query: "source:view", result: all docs having some source code > > > snippet regarding cocoon view concept. > > > > > > ad 3. query: "xpath:**/s2/title/Intro", result all docs having > > s2 title > > > Intro, not sure about this how to marry lucene with xpath > > > > > > > > > > > > Let's say I want to provide one or more sub-sitemaps > > > > > a searching feature, and let's say the index is already > > > > > generated, how can i calculate from the internal sitemap URL > > > > > to public browser-URL? > > > > > > > > > > For example I have an index over all /docs/samples/*/* files, > > > > > how can I detect that they are all mapped to the URL > > > > " target="l">http://machine/*/*?> > > > > > any ideas are welcome? > > > > > > > > The CLI subsystem works by starting at a URI, asking for the > > > > "link" view > > > > of that URI (cocoon will then return a newline-separated list > > of > > > > linkedURIs created out of all those links that contain > > > > xlink:href="" or src="" > > > > or href="" attributes), then recursively call itself on every > > linked> > URI. > > > > > > > > When it reaches a leaf (a page with no further links or links > > that > > > > werealready visited), it asks for the "link-translated" view > > of > > > > the URI, > > > > passing in POST to the request the new-line separated list of > > > > links so > > > > that Cocoon knows how to regenerate an adapted version of the > > resource> > (this is useful to maintain link consistency when > > moved on a file > > > > systemand workign on the original link semantics, it works for > > > > every file > > > > format, even for PDF, because link translation happens > > transparently> > before serialization takes place). > > > > > > > > Last operation is URI mangling where, depending on the give > > MIME- > > > > type of > > > > the returned resource, the proper extension is added to the > > file name > > > > and the resource is saved on disk. > > > > > > > > Another important feature is that the "link" view also > > indicates as > > > > "dynamic" those links that have a particular xlink role (behavior) > > > > xlink:role="dynamic", so they are skipped by the CLI > > generation > > > > and a > > > > placeholder is written (that might redirect to the original > > URI, for > > > > example). > > > > > > > > So, currently, indexers like lucene assume that what goes out > > of a web > > > > server is what is already in (at least, for static pages). Cocoon > > > > doesn't work that way. > > > > > > > > So, the indexer should crawl from the end side (the web side, > > just > > > > likebig search engine do) and don't assume anything about how > > the > > > > files are > > > > generated internally. > > > > > > > > The only different is that Cocoon implements a standard > > behavior of > > > > resource views and we can use those to gain more information > > about the > > > > requests without missing the semantic information that cocoon > > already> > stores (such as the xlink information). > > > > > > > > So, IMO, the most elegant and effective solution would be to > > connect> > lucene to the cocoon view-based crawling subsystem: > > > > > > > > 1) start with some URI (the root, mostly) > > > > 2) obtain the link view of the resource > > > > 3) recursively call itself on non-dynamic links until a leaf > > is > > > > reached 4) obtain the leaf resource (performing translation to > > > > adapt the > > > > cocoon-relative URIs to the site-relative URIs) and push it > > into > > > > lucene 5) continue until all leafs are processed. > > > > > > I will try to implement something like that... > > > > > > Design-Draft > > > > > > 1. Crawling: > > > Usign the above described cocoon view-based crawling subsystem > > > > > > 2. Indexing: > > > 2.1 Each element-name will create a lucene field having the > > > same name as the element-name. > > > (?What about the element's name space, should I take it into > > account?)> > > > 2.2 Each attribute of an element will create a lucene field having > > > the concated name of the element-name, and the attribute-name. > > > 2.3 Having a field named body for the bare text. > > > > > > 3. Searching > > > Just use the lucene search engine. > > > > > > (btw, > > > I was already playing with lucene for indexing/searching mail > > messages> stored in mbox. This way I was searching the > > > http://xml.apache.org/mails/200109.gz, > > > > > > Wouldn't it be nice to generate FAQ, etc from the mbox mail > > messages.> But that's a semantic problem, as the mail messages > > have poor > > > xml-semantic content :-) > > > ) > > > > > > > Note that "dynamic" has a different sense that before and it > > means > > > > thatthe resource result is not dependent on request-based or > > > > environmentalparameters (such as user-agent, date, time, > > machine > > > > load, IP address, > > > > whatever). A resource that is done aggregating a ton of > > documents > > > > storedon a database must be considered static if it is not > > > > dependent of > > > > request parameters. > > > > > > > > For a semantic crawler, instead of asking for the "standard" > > view, it > > > > would ask for semantic-specific views such as "content" (the most > > > > semantic stage at pipeline generation, which we already > > specify in our > > > > example sitemaps) or "schema" (not currently implemented as > > nobody > > > > woulduse it today anyway). > > > > > > > > But the need of resource "views" is the key to the success of > > proper> > search capabililities and we must be sure that we use > > them even for > > > > semantically-poor searching solutions like lucene, but that > > would kick > > > > ass anyway on small to medium size web sites. > > > > > > > > Hope this helps and if you have further questions, don't mind > > asking.> > > > thanks for your suggestions, helping a lot to understand cocoon > > better. > > > > > > bye berni --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]