hi, david thanks for the links to z39.50. I read a bit about that protocol, but as I understand supporting z39.50 might require to write an avalon block implementing the z39.50 server, that's at the moment a bit too much for me, learning avalon in depth + z39.50, anyway thanks!
----- Originalnachricht ----- Von: David Crossley <[EMAIL PROTECTED]> Datum: Montag, Oktober 29, 2001 7:52 am Betreff: Re: Subject: Lucene as Avalon Component? > Structured searching is an obvious beneficiary of a solid > XML framework. Cocoon would need capability to allow > such functionality to be implemented by any search system > of choice. > > I would prefer to utilise the Z39.50 protocol (ISO 23950). > This is stateful and session-based. It suports both fielded > and full-text search. It has a powerful boolean and relational > query syntax and various high-level abstractions. > > Importantly, there are sets of well-known attributes which > shield the user from how the search is implemented and from > how the XML records are structured. (Bernhard, this directly > addresses your three numbered issues below.) > > Of course, this power comes at the cost of potentially > complex implemention. However, this is eased by the > availability of solid toolkits and fully blown servers/gateways > (both open source and the other). > > This is the age-old search and retrieve protocol from the > library world, so plenty of leverage can be gained. > Start at: http://lcweb.loc.gov/z395 > Also follow their links to resources/ > I see there at least one appropriate solution for Cocoowhich is > open source and Java (JZKit). > > Thanks Bernhard, for raising this important topic. > --David Crossley > > Bernhard Huber wrote: > > Stefano Mazzocchi wrote: > > > Bernhard, perfect timing! I was thinking about the same thing > the > > > otherday. > > > > > > Bernhard Huber wrote: > > > > > > > > hi, > > > > I'm taking a look at lunce, a nice search engine. > > > > As Cocoon2 claims to be an XML publishing engine, > > > > some sort of searching feature would be quite nice. > > > > > > Yes, this is very true. > > > > > > > Now I'm a bit confused how to make it usabel under Cocoon2. > > > > Should I write a generator for the searching part of lucene? > > > > Should I encapsulate the indexing, and searching as > > > > an avalon component? > > > > > > In a perfect world (but we aim for that, right?) we should > have an > > > abstracted search engine behavioral interface (future > compatible with > > > semantic capabilities?) and then have an Avalon component > (block?) to > > > implement that. > > > > and the search-engine understands your queries, semantically :-) > > But perhaps an advantage could be that a group of documents might > > present already some semantic keywords, stored in the documents, > > like author, and title. > > So searching for this keywords will give very good results. > > > > > Then, a cocoon component (a generator or a transformer, > depending > > > on the > > > syntax of the query language being XML or not) can use the avalon > > > component to power itself and generate the XML event stream. > > > > Yup, that's would be nice. > > Moreover we can use the XML event stream not only for generating > > the answer of the search-query/request, but evaluate some hit > > statistics. > > > > As the XML event stream can be handled as some static xml page > source.> > > > Note that both Lucene and dbXML (probably going to be called > Apache> > Xindice, from the latin word "indice" -> "index") could > power > > > this: the > > > first as an indexer of the textual part (final pipeline > results) while > > > the second being an indexer of the semantic part (starting > pipeline> > sources). > > > > > > Obviously, a semantic approach is very likely to yield much better > > > results, but it requires a completely different way of doing > search> > (look at xyzsearch.com, for example), while lucene is > simply doing > > > textual heuristics. > > I will try to check xyzsearch.com > > > > But I have some troubles with "semantic". > > > > As I would say "semantic" lies in the eye of the observer. > > But that's more philosophical. > > > > Perhaps it would be interesting to gather some ideas, > > about what's the aim of using semantic search. > > > > Although the simple textual search gives a lot of bad results, > > it is simple to use. > > > > Using a semantic search should give better results, as the > > elements are taken into account when generating an index, > > and when evaluating the result of a query. > > But some points to think about: > > 1. What does to user should know already about the semantic of > the > > documents? > > > > 2. Does he/she have to know that a document has an author, for > example?> > > 3. Does he/she have to know that querying for author entering > > "author:john" will search of the author's name. > > > > Perhaps all 3 issues are just a questing of design the GUI of > > an semantic search... > > > > Just read now > > http://localhost:8080/cocoon/documents/emotional-landscapes.html, > > I see, semantic is taken the xml element's into account. > > > > > This said, it's also likely that the two approaches are so > different> > that a single behavioral interface will be either too > general or too > > > simple to cover both cases, so, probably, both a textual search > > > interface and a markup search interface will be required. > > > > > > > How should I index? > > > > > > Eh, good question :) > > > > > > My suggestion would be to connect the same xlink-based crawling > > > subsystem used for CLI to lucene as it was a file system, but > this > > > mightrequire some Inversion of Control (us pushing files into > > > lucene and not > > > lucene to crawl them or read them from disk) thus some code > > > changes to > > > it. > > I understand your hint. > > I must admit that I never understood cocoon's view concept. > > Now I see what I can do using views. > > Perhaps adding an example in the view documentation, like > > Try using: > > http://localhost:8080/cocoon/welcome?cocoon-view=content, or > > http://localhost:8080/cocoon/welcome?cocoon-view=links > > would help a lot. > > But perhaps I'm just a bit slow.... > > > > I never supposed to index the html result of an page, > > but the xml content (ad fontes!). > > Thus I was thinking about how to index a xml source. > > > > Or saying a more generally: > > What would be a smart xml indexing strategy? > > > > Lets take an snippet of > > http://localhost:8080/cocoon/documents/views.html?cocoon- > view=content> > > ----- begin > > .... > > <s1 title="The Views"> > > <s2 title="Introduction"> > > <p> Views are yet another sitemap component. Unlike the rest, they > > are othogonal to the resource and pipeline definitions. In the > > ... > > <s3 title="View Processing"> > > <p>The samples sitemap contains two view definitions. One of them > > looks like the excerpt below.</p> > > <source xml:space="preserve"> > > > > <map:views> > > <map:view name="content" from-label="content"> > > <map:serialize type="xml"/> > > </map:view> > > > > </source> > > .... > > ----- end > > > > I see following options: > > 1. Index only the bare text. That's simple, and stupid, > > as a lot of info entered by the xml generator (human, program) > > is ignored. > > 2. Try to take the element's name, and/or attributes into account. > > 3. Try to take the elements path into account. > > > > Let's see what queries an engine should answer: > > ad 1. query: "Intro", result: all docs having text cocoon > > > > ad 2. query: "title:Intro", result: all docs having title > elements with > > text Intro. > > > > ad 2. query: "source:view", result: all docs having some source code > > snippet regarding cocoon view concept. > > > > ad 3. query: "xpath:**/s2/title/Intro", result all docs having > s2 title > > Intro, not sure about this how to marry lucene with xpath > > > > > > > > > Let's say I want to provide one or more sub-sitemaps > > > > a searching feature, and let's say the index is already > > > > generated, how can i calculate from the internal sitemap URL > > > > to public browser-URL? > > > > > > > > For example I have an index over all /docs/samples/*/* files, > > > > how can I detect that they are all mapped to the URL > > > " target="l">http://machine/*/*?> > > > > any ideas are welcome? > > > > > > The CLI subsystem works by starting at a URI, asking for the > > > "link" view > > > of that URI (cocoon will then return a newline-separated list > of > > > linkedURIs created out of all those links that contain > > > xlink:href="" or src="" > > > or href="" attributes), then recursively call itself on every > linked> > URI. > > > > > > When it reaches a leaf (a page with no further links or links > that > > > werealready visited), it asks for the "link-translated" view > of > > > the URI, > > > passing in POST to the request the new-line separated list of > > > links so > > > that Cocoon knows how to regenerate an adapted version of the > resource> > (this is useful to maintain link consistency when > moved on a file > > > systemand workign on the original link semantics, it works for > > > every file > > > format, even for PDF, because link translation happens > transparently> > before serialization takes place). > > > > > > Last operation is URI mangling where, depending on the give > MIME- > > > type of > > > the returned resource, the proper extension is added to the > file name > > > and the resource is saved on disk. > > > > > > Another important feature is that the "link" view also > indicates as > > > "dynamic" those links that have a particular xlink role (behavior) > > > xlink:role="dynamic", so they are skipped by the CLI > generation > > > and a > > > placeholder is written (that might redirect to the original > URI, for > > > example). > > > > > > So, currently, indexers like lucene assume that what goes out > of a web > > > server is what is already in (at least, for static pages). Cocoon > > > doesn't work that way. > > > > > > So, the indexer should crawl from the end side (the web side, > just > > > likebig search engine do) and don't assume anything about how > the > > > files are > > > generated internally. > > > > > > The only different is that Cocoon implements a standard > behavior of > > > resource views and we can use those to gain more information > about the > > > requests without missing the semantic information that cocoon > already> > stores (such as the xlink information). > > > > > > So, IMO, the most elegant and effective solution would be to > connect> > lucene to the cocoon view-based crawling subsystem: > > > > > > 1) start with some URI (the root, mostly) > > > 2) obtain the link view of the resource > > > 3) recursively call itself on non-dynamic links until a leaf > is > > > reached 4) obtain the leaf resource (performing translation to > > > adapt the > > > cocoon-relative URIs to the site-relative URIs) and push it > into > > > lucene 5) continue until all leafs are processed. > > > > I will try to implement something like that... > > > > Design-Draft > > > > 1. Crawling: > > Usign the above described cocoon view-based crawling subsystem > > > > 2. Indexing: > > 2.1 Each element-name will create a lucene field having the > > same name as the element-name. > > (?What about the element's name space, should I take it into > account?)> > > 2.2 Each attribute of an element will create a lucene field having > > the concated name of the element-name, and the attribute-name. > > 2.3 Having a field named body for the bare text. > > > > 3. Searching > > Just use the lucene search engine. > > > > (btw, > > I was already playing with lucene for indexing/searching mail > messages> stored in mbox. This way I was searching the > > http://xml.apache.org/mails/200109.gz, > > > > Wouldn't it be nice to generate FAQ, etc from the mbox mail > messages.> But that's a semantic problem, as the mail messages > have poor > > xml-semantic content :-) > > ) > > > > > Note that "dynamic" has a different sense that before and it > means > > > thatthe resource result is not dependent on request-based or > > > environmentalparameters (such as user-agent, date, time, > machine > > > load, IP address, > > > whatever). A resource that is done aggregating a ton of > documents > > > storedon a database must be considered static if it is not > > > dependent of > > > request parameters. > > > > > > For a semantic crawler, instead of asking for the "standard" > view, it > > > would ask for semantic-specific views such as "content" (the most > > > semantic stage at pipeline generation, which we already > specify in our > > > example sitemaps) or "schema" (not currently implemented as > nobody > > > woulduse it today anyway). > > > > > > But the need of resource "views" is the key to the success of > proper> > search capabililities and we must be sure that we use > them even for > > > semantically-poor searching solutions like lucene, but that > would kick > > > ass anyway on small to medium size web sites. > > > > > > Hope this helps and if you have further questions, don't mind > asking.> > > thanks for your suggestions, helping a lot to understand cocoon > better. > > > > bye berni > > > ------------------------------------------------------------------- > -- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, email: [EMAIL PROTECTED] > >
begin:vcard n:Huber;Bernhard fn:Bernhard Huber version:2.1 email;internet:[EMAIL PROTECTED] end:vcard
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]