Re: Subject: Lucene as Avalon Component?

David Crossley Wed, 31 Oct 2001 23:39:15 -0800

Bernhard Huber wrote:
> hi, david
> thanks for the links to z39.50.
> I read a bit about that protocol, but as I understand
> supporting z39.50 might require to write an
> avalon block implementing the z39.50 server,
> that's at the moment a bit too much for me,
> learning avalon in depth + z39.50,
> anyway thanks!


It sure is big, Bernhard. I only meant to point out other
options so that cocoon-dev did not jump to any conclusions
about which software to use, and so that design allowed
other options to be integrated by Cocoon end-users.

These issues can be addressed, now that Stefano has
started a separate thread "[RT] semantic searching".

--David

> ----- Originalnachricht -----
> Von: David Crossley <[EMAIL PROTECTED]>
> Datum: Montag, Oktober 29, 2001 7:52 am
> Betreff: Re: Subject: Lucene as Avalon Component?
> 
> > Structured searching is an obvious beneficiary of a solid
> > XML framework. Cocoon would need capability to allow
> > such functionality to be implemented by any search system
> > of choice.
> > 
> > I would prefer to utilise the Z39.50 protocol (ISO 23950).
> > This is stateful and session-based. It suports both fielded
> > and full-text search. It has a powerful boolean and relational
> > query syntax and various high-level abstractions.
> > 
> > Importantly, there are sets of well-known attributes which
> > shield the user from how the search is implemented and from
> > how the XML records are structured. (Bernhard, this directly
> > addresses your three numbered issues below.)
> > 
> > Of course, this power comes at the cost of potentially
> > complex implemention. However, this is eased by the
> > availability of solid toolkits and fully blown servers/gateways
> > (both open source and the other).
> > 
> > This is the age-old search and retrieve protocol from the
> > library world, so plenty of leverage can be gained.
> > Start at: http://lcweb.loc.gov/z3950/agency/
> > Also follow their links to resources/
> > I see there at least one appropriate solution for Cocoon
> > which is open source and Java (JZKit).
> > 
> > Thanks Bernhard, for raising this important topic.
> > --David Crossley
> > 
> > Bernhard Huber wrote:
> > >  Stefano Mazzocchi wrote:
> > > > Bernhard, perfect timing! I was thinking about the same thing 
> > the 
> > > > otherday.
> > > > 
> > > > Bernhard Huber wrote:
> > > > > 
> > > > > hi,
> > > > > I'm taking a look at lunce, a nice search engine.
> > > > > As Cocoon2 claims to be an XML publishing engine,
> > > > > some sort of searching feature would be quite nice.
> > > > 
> > > > Yes, this is very true.
> > > > 
> > > > > Now I'm a bit confused how to make it usabel under Cocoon2.
> > > > > Should I write a generator for the searching part of lucene?
> > > > > Should I encapsulate the indexing, and searching as
> > > > > an avalon component?
> > > > 
> > > > In a perfect world (but we aim for that, right?) we should 
> > have an
> > > > abstracted search engine behavioral interface (future 
> > compatible with
> > > > semantic capabilities?) and then have an Avalon component 
> > (block?) to
> > > > implement that.
> > > 
> > > and the search-engine understands your queries, semantically :-)
> > > But perhaps an advantage could be that a group of documents might
> > > present already some semantic keywords, stored in the documents,
> > > like author, and title.
> > > So searching for this keywords will give very good results.
> > > 
> > > > Then, a cocoon component (a generator or a transformer, 
> > depending 
> > > > on the
> > > > syntax of the query language being XML or not) can use the avalon
> > > > component to power itself and generate the XML event stream.
> > > 
> > > Yup, that's would be nice. 
> > > Moreover we can use the XML event stream not only for generating
> > > the answer of the search-query/request, but evaluate some hit 
> > > statistics. 
> > > 
> > > As the XML event stream can be handled as some static xml page 
> > source.> 
> > > > Note that both Lucene and dbXML (probably going to be called 
> > Apache> > Xindice, from the latin word "indice" -> "index") could 
> > power 
> > > > this: the
> > > > first as an indexer of the textual part (final pipeline 
> > results) while
> > > > the second being an indexer of the semantic part (starting 
> > pipeline> > sources).
> > > > 
> > > > Obviously, a semantic approach is very likely to yield much better
> > > > results, but it requires a completely different way of doing 
> > search> > (look at xyzsearch.com, for example), while lucene is 
> > simply doing
> > > > textual heuristics.
> > > I will try to check xyzsearch.com
> > > 
> > > But I have some troubles with "semantic".
> > > 
> > > As I would say "semantic" lies in the eye of the observer.
> > > But that's more philosophical.
> > > 
> > > Perhaps it would be interesting to gather some ideas,
> > > about what's the aim of using semantic search.
> > > 
> > > Although the simple textual search gives a lot of bad results,
> > > it is simple to use.
> > > 
> > > Using a semantic search should give better results, as the 
> > > elements are taken into account when generating an index,
> > > and when evaluating the result of a query.
> > > But some points to think about:
> > > 1. What does to user should know already about the semantic of 
> > the 
> > > documents?
> > > 
> > > 2. Does he/she have to know that a document has an author, for 
> > example?> 
> > > 3. Does he/she have to know that querying for author entering
> > > "author:john" will search of the author's name.
> > > 
> > > Perhaps all 3 issues are just a questing of design the GUI of 
> > > an semantic search...
> > > 
> > > Just read now
> > > http://localhost:8080/cocoon/documents/emotional-landscapes.html,
> > > I see, semantic is taken the xml element's into account.
> > > 
> > > > This said, it's also likely that the two approaches are so 
> > different> > that a single behavioral interface will be either too 
> > general or too
> > > > simple to cover both cases, so, probably, both a textual search
> > > > interface and a markup search interface will be required.
> > > > 
> > > > > How should I index?
> > > > 
> > > > Eh, good question :)
> > > > 
> > > > My suggestion would be to connect the same xlink-based crawling
> > > > subsystem used for CLI to lucene as it was a file system, but 
> > this 
> > > > mightrequire some Inversion of Control (us pushing files into 
> > > > lucene and not
> > > > lucene to crawl them or read them from disk) thus some code 
> > > > changes to
> > > > it.
> > > I understand your hint. 
> > > I must admit that I never understood cocoon's view concept.
> > > Now I see what I can do using views.
> > > Perhaps adding an example in the view documentation, like
> > > Try using: 
> > > http://localhost:8080/cocoon/welcome?cocoon-view=content, or
> > > http://localhost:8080/cocoon/welcome?cocoon-view=links
> > > would help a lot.
> > > But perhaps I'm just a bit slow....
> > > 
> > > I never supposed to index the html result of an page,
> > >  but the xml content (ad fontes!).
> > > Thus I was thinking about how to index a xml source.
> > > 
> > > Or saying a more generally:
> > > What would be a smart xml indexing strategy?
> > > 
> > > Lets take an snippet of 
> > > http://localhost:8080/cocoon/documents/views.html?cocoon-
> > view=content> 
> > > ----- begin
> > > .... 
> > > <s1 title="The Views">   
> > > <s2 title="Introduction">
> > > <p> Views are yet another sitemap component. Unlike the rest, they
> > >     are othogonal to the resource and pipeline definitions. In the
> > > ...
> > > <s3 title="View Processing">   
> > > <p>The samples sitemap contains two view definitions. One of them
> > >      looks like the excerpt below.</p>
> > > <source xml:space="preserve">
> > > 
> > >   <map:views&gt;
> > >      <map:view name="content" from-label="content"&gt;
> > >      <map:serialize type="xml"/&gt;
> > >   </map:view&gt;
> > > 
> > >      </source>
> > > ....
> > > ----- end
> > > 
> > > I see following options:
> > > 1. Index only the bare text. That's simple, and stupid,
> > > as a lot of info entered by the xml generator (human, program)
> > > is ignored.
> > > 2. Try to take the element's name, and/or attributes into account.
> > > 3. Try to take the elements path into account.
> > > 
> > > Let's see what queries an engine should answer:
> > > ad 1. query: "Intro", result: all docs having text cocoon
> > > 
> > > ad 2. query: "title:Intro", result: all docs having title 
> > elements with 
> > > text Intro.
> > > 
> > > ad 2. query: "source:view", result: all docs having some source code
> > > snippet regarding cocoon view concept.
> > > 
> > > ad 3. query: "xpath:**/s2/title/Intro", result all docs having 
> > s2 title
> > > Intro, not sure about this how to marry lucene with xpath
> > > 
> > > > 
> > > > > Let's say I want to provide one or more sub-sitemaps
> > > > > a searching feature, and let's say the index is already
> > > > > generated, how can i calculate from the internal sitemap URL
> > > > > to public browser-URL?
> > > > > 
> > > > > For example I have an index over all /docs/samples/*/* files,
> > > > > how can I detect that they are all mapped to the URL 
> > > > " target="l">http://machine/*/*?> 
> > > > > any ideas are welcome?
> > > > 
> > > > The CLI subsystem works by starting at a URI, asking for the 
> > > > "link" view
> > > > of that URI (cocoon will then return a newline-separated list 
> > of 
> > > > linkedURIs created out of all those links that contain 
> > > > xlink:href="" or src=""
> > > > or href="" attributes), then recursively call itself on every 
> > linked> > URI. 
> > > > 
> > > > When it reaches a leaf (a page with no further links or links 
> > that 
> > > > werealready visited), it asks for the "link-translated" view 
> > of 
> > > > the URI,
> > > > passing in POST to the request the new-line separated list of 
> > > > links so
> > > > that Cocoon knows how to regenerate an adapted version of the 
> > resource> > (this is useful to maintain link consistency when 
> > moved on a file 
> > > > systemand workign on the original link semantics, it works for 
> > > > every file
> > > > format, even for PDF, because link translation happens 
> > transparently> > before serialization takes place).
> > > > 
> > > > Last operation is URI mangling where, depending on the give 
> > MIME-
> > > > type of
> > > > the returned resource, the proper extension is added to the 
> > file name
> > > > and the resource is saved on disk.
> > > > 
> > > > Another important feature is that the "link" view also 
> > indicates as
> > > > "dynamic" those links that have a particular xlink role (behavior)
> > > > xlink:role="dynamic", so they are skipped by the CLI 
> > generation 
> > > > and a
> > > > placeholder is written (that might redirect to the original 
> > URI, for
> > > > example).
> > > > 
> > > > So, currently, indexers like lucene assume that what goes out 
> > of a web
> > > > server is what is already in (at least, for static pages). Cocoon
> > > > doesn't work that way.
> > > > 
> > > > So, the indexer should crawl from the end side (the web side, 
> > just 
> > > > likebig search engine do) and don't assume anything about how 
> > the 
> > > > files are
> > > > generated internally.
> > > > 
> > > > The only different is that Cocoon implements a standard 
> > behavior of
> > > > resource views and we can use those to gain more information 
> > about the
> > > > requests without missing the semantic information that cocoon 
> > already> > stores (such as the xlink information).
> > > > 
> > > > So, IMO, the most elegant and effective solution would be to 
> > connect> > lucene to the cocoon view-based crawling subsystem:
> > > > 
> > > > 1) start with some URI (the root, mostly)
> > > > 2) obtain the link view of the resource
> > > > 3) recursively call itself on non-dynamic links until a leaf 
> > is 
> > > > reached 4) obtain the leaf resource (performing translation to 
> > > > adapt the
> > > > cocoon-relative URIs to the site-relative URIs) and push it 
> > into 
> > > > lucene 5) continue until all leafs are processed.
> > > 
> > > I will try to implement something like that...
> > > 
> > > Design-Draft
> > > 
> > > 1. Crawling:
> > >   Usign the above described cocoon view-based crawling subsystem
> > > 
> > > 2. Indexing:
> > > 2.1 Each element-name will create a lucene field having the
> > >   same name as the element-name.
> > >   (?What about the element's name space, should I take it into 
> > account?)> 
> > > 2.2 Each attribute of an element will create a lucene field having
> > >   the concated name of the element-name, and the attribute-name.
> > > 2.3 Having a field named body for the bare text.
> > > 
> > > 3. Searching
> > >   Just use the lucene search engine.
> > > 
> > > (btw, 
> > > I was already playing with lucene for indexing/searching mail 
> > messages> stored in mbox. This way I was searching the 
> > > http://xml.apache.org/mails/200109.gz,
> > > 
> > > Wouldn't it be nice to generate FAQ, etc from the mbox mail 
> > messages.> But that's a semantic problem, as the mail messages 
> > have poor
> > > xml-semantic content :-)
> > > )
> > >  
> > > > Note that "dynamic" has a different sense that before and it 
> > means 
> > > > thatthe resource result is not dependent on request-based or 
> > > > environmentalparameters (such as user-agent, date, time, 
> > machine 
> > > > load, IP address,
> > > > whatever). A resource that is done aggregating a ton of 
> > documents 
> > > > storedon a database must be considered static if it is not 
> > > > dependent of
> > > > request parameters.
> > > > 
> > > > For a semantic crawler, instead of asking for the "standard" 
> > view, it
> > > > would ask for semantic-specific views such as "content" (the most
> > > > semantic stage at pipeline generation, which we already 
> > specify in our
> > > > example sitemaps) or "schema" (not currently implemented as 
> > nobody 
> > > > woulduse it today anyway).
> > > > 
> > > > But the need of resource "views" is the key to the success of 
> > proper> > search capabililities and we must be sure that we use 
> > them even for
> > > > semantically-poor searching solutions like lucene, but that 
> > would kick
> > > > ass anyway on small to medium size web sites.
> > > > 
> > > > Hope this helps and if you have further questions, don't mind 
> > asking.> 
> > > thanks for your suggestions, helping a lot to understand cocoon 
> > better. 
> > > 
> > > bye berni

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: Subject: Lucene as Avalon Component?

Reply via email to