[resent since apparently it got lost somewhere] Ciao,
Bernhard started a great thread about adding search capabilities with lucene, but I'd love to give some more impressions on that. Bernhard Huber wrote: > > In a perfect world (but we aim for that, right?) we should have an > > abstracted search engine behavioral interface (future compatible with > > semantic capabilities?) and then have an Avalon component (block?) to > > implement that. > > and the search-engine understands your queries, semantically :-) Yeah right :) > But perhaps an advantage could be that a group of documents might > present already some semantic keywords, stored in the documents, > like author, and title. > So searching for this keywords will give very good results. I see several levels of search from the least semantic to the most semantic: 1) regexp matching (i.e. grep): no semantic is associated to the search since it's up to the user to perform the semantic analysis that leads to the creation of the regexp query to match. This results in boolean search (either it matches or not) and assumes the content is stored in textual formats. 2) text search engine (i.e. altavista): heuristics are used to extract sort-of semantic content from some known document types (mostly HTML) and associate some indexing value to them. This leads to an easier user experience. 3) metadata-based search enginges (i.e. metacrawler): same as above, but with the use of the <meta> HTML tag to associate keywords to higher values of the search. Gives normally better searches, even if sometimes keywords are misleading. 4) hyperlink-topology based search engines (i.e. google): they have the ability to estimate the importance of a page given the links that refer to it. Obviously, this can only happen when you have a "huge" pool of pages, as google does. Note that google is also able to parse and index PDF and extract heuristics from the internal graphics (font size, bold, italic and so on). This is the state of the art. Google is, by far, the most advanced searching solution available but due to its nature it cannot be applied to a small site without loosing the power of topological analysis (thus, we go back to number 3). Web crawlers are forced to obtain the web site information by "crawling" it from the outside since they don't know the internal of the site. But local search solutions can have access to the web site from the backside and index them (see htdig, for example, or Oracle text search tools if text is stored into their databases). All these solutions work as a restricted version of #3 above, but they are based on the assumption that the URI space can be easily mapped to the internal request. Apache might show you the opposite (at first!), but Cocoon shows this is very unlikely to be the case since it's generally a mistake to map a file system (or a directory server, or a database repository) one-2-one with the URI space, since it leads to easily broken links and potential security issues. This is why crawling is the only way to go, but since outside access reduces the visibility of some internal information that might increase the semantic capacity of the indexer, Cocoon provies "views" (you can think of them as "windows", but not in the M$ sense) to the resources. This said, we can now have access to the original content of the resource. For example, we can now index the text inside a logo, if we are given the SVG content that generated the raster image. Or can index the PDF content without having to implement a PDF parser since we request the "content" view of the resource and we obtain an easily parsable XML file. Now, in a perfect world (again!), we could have a browser that allows us to add specific HTTP headers to the request, thereforse, we could have cocoon react to an HTTP header to know which view (also known as resource "variant" in the HTTP spec) was requested. The current way for Cocoon to access views is fixed as a special URI query parameter "cocoon-view", but I think we should extend the feature to: 1) react on a "variant" HTTP header (nothing cocoon specific since the concept could be impelemented later on by other publishing frameworks) 2) react on URI extension: for example http://host/path/file.view, that is something that I normally do by hand in my sitemaps (where http://host/path/index is the default resource and index.content is the XML view of the content). 3) react on URI query parameter (as we do today). You could suggest to make this user-definable in the sitemap: well, while the views are user definable (even if a number will be suggested as a solid contract to allow indexing of other cocoons), I woundn't like this to become too flexible since this is a solid contract that, if broken, doesn't allow a crawler to obtain semantic information on a site it doesn't own. Ok, now, let us suppose we have our good Cocoon in place with a bunch of XML content and a way (thru resource views) to obtain the most semantic version of this content. What can we do with it? 5) schema based search engines: as markup is bidimensional (text + tag), we can now look for the text "text" inside the tag "tag". So, if you know the schema used (say, docbook), you can place a query such as search for "cocoon" in elements "title|subtitle" of namespace "http://docbook.org/*" with xml:lang "undefined|EN" that will return you the documents who happen to have the text "cocoon" inside their "title" or "subtitle" elements associated to the namespace starting with the "http://docbook.org/" URL and using the English language or having no language definition. I call this "schema based" assuming that each schema has an associated namespace. Note that this also capable of performing metadata evaluation: a query such as search for "Stefano" and "Mazzocchi" in elements "author" of namespace "http://dublin-core.org/*" will work on the metadata markup associated with the dublin core namespace. Note also that just like many search engine, this is a very powerful syntax, but pretty unlikely that a user with no XML knowledge will be able to use it. There are possible ways of creating such a query, one being the one used in xyzsearch.com which creates a complex schema-based query based on an incremental process (they claim a patent on that, but you can patent a process, not an idea and they don't have Cocoon views under their process): a) search for "Cocoon" Search for [Cocoon ] search | continue >> b) it returns the list of schemas associated with the elements where the word Cocoon was found and lists a human readable definition of that schema. For example: Markups where "Cocoon" was found: [ ] Zoological Markup Language [ ] Docbook [ ] Motion Pictures Description Language << back | search | continue >> c) then you click on which markup you like to choose (hopefully understanding from the human description of the namespace what the language is about). d) then provides you the list of languages it was found in: Languages where the term "Cocoon" was found within markup "Docbook": [ ] undefined [ ] English (general) [ ] Italian << back | search | continue >> e) then you click on the language and asks you to indicate which tags you'd like Contexts where the term "Cocoon" was found within markup "Docbook" and language "undefined" or "English": [ ] title : the title of the document [ ] subtitle : the subtitle of the document [ ] para : a paragraph [ ] strong : outlines important words << back | search | continue >> And so on, until the user hits the "search" button and then the list is presented. In order to implement the above we need: a) a bunch of valid XML documents b) a register of namespaces -> schemas, along with some human readable description of tags and schemas (which can be provided with the XMLSchema schema itself) c) an xml-based storage system with advanced query capabilities (XPath or even better, XQL). d) a view capable web publishing system. e) a view-based schema-aware crawler and indexer. f) a web application that connects to the indexer and provides the above user experience. These are all independent concern islands. The contracts are: a) and b) are stored into c) (IMO, WebDAV or CVS would be the best contracts here allowing editors to edit the files as they were on a file system) d) uses c) as semi-structured data repository (XMLDB API being the contract, or something equivalent) e) uses d) to obtain the semantic content and index the site (HTTP and views being the contract) f) uses e) to provide the search experience (no contract nefined here, probably the software API or some general-enough searching API, maybe even Lucene's if powerful enough) There is still a long way to go to have the entire system in place, but now that we have both an native XML DB and an indexing engine under Apache, I hope this is going to move faster. Of course, the editing part remains the most difficult one to solve :/ 7) semantic search engine: if you are reading this far, I presume you'd consider the above #6 a kick ass search engine and would likely stop there. Well, there is more and this is where the semantic web effort kicks in. The previous technology (#6 from now onward) requires a bunch of software that is yet to be written, but it's very much likely to happen. Or, at least, I don't see any technical nor social reason why this should not happen. This, unfortunately, cannot be said for a semantic search engine (#7). Let's start from outter space: you know what "semantic networks" are, right? they are also known as "topic maps" (see www.topicmaps.org for more details) and they represent a topological connection of "concepts", along with their relationships. The basic idea is the following: 1) suppose you have a bunch of semantically marked-up content 2) each important resource (not a web resource, but a semantic resource, i.e. a word) is properly described in absolute and unique terms. That is, currently, with an associated unique URI. 3) there are semantic networks that describe relationships between these resources With this infrastructure in place, it is virtually possible to use basic inference rules to "crawl" the semantic networks and obtain search derivatives which are semantically meaningful. Let's make an example: 1) suppose that your homepage states that you have two children: bob and susan. Bob is a 6-years-old boy and Susan is a 12-years-old girl. You are 42 and live in Boston. 2) suppose that you used proper markup (say RDF) to describe these relationships and you used the proper markup to indicate them. 3) now, a semantic crawler comes and index this information. 4) it is virtually possible, then, to ask for something like "give me the name of those man in boston who have two or more children under 15" without requiring any heuristical artificial intelligence. Now, in order to obtain this we need: a) the infrastructure of #6 b) a huge list of topics along with their unique meaning (unique in this case means that each topic (say "father") must have one and only one URI (say "http://www.un.org/topics/mankind/family/father") associated (or topic maps that state the formal equivalence of topics). c) topic maps that state the relationships of those topics d) a way to create the query in a user-friendly way Well, given the political problems found in defining even the most simple B2B schema, I strongly doubt we'll ever come this far. And even if we do come this far and this huge semantic network gets implemented, the problem is making it possible (and profitable!) for authors to markup their content in such a way that they are semantic friendly in this topic-map sense. And given the amount of people who think that M$ Word is the best authoring tool, well, authoring the information will sure be the worst part of both 6# and 7#. > But I have some troubles with "semantic". > > As I would say "semantic" lies in the eye of the observer. > But that's more philosophical. I hope the above explains better my meaning of "semantic". > Perhaps it would be interesting to gather some ideas, > about what's the aim of using semantic search. > > Although the simple textual search gives a lot of bad results, > it is simple to use. Correct. Both 6# and 7# might be extremely powerful but useless if people are unable to search due to usability complexity. In fact, the weak point of #6 (after talking with my girlfriend about it) is that the people might believe it's broken or they did something wrong if they don't see results but a list of contexts to go further. Anyway, the above is just an example, not the best way to implement such a system. > Using a semantic search should give better results, as the > elements are taken into account when generating an index, > and when evaluating the result of a query. Well, not really. Suppose you don't go as far as stating that you want "Cocoon" inside the element "title". If you find "cocoon" in HTML <title> you know this is better than finding "cocoon" in <p>, but what if you have a chinese markup? how do you know? So, I envision something like a heuristical map for tags and tag inclusions that states the relative value of finding a word in a particular location. So, para -> 1 strong -> 1 title -> 10 then /article/title/strong -> 10 + 1 = 11 /para/strong -> 1 + 1 = 2 /section/title -> 10 and so on, which might work for every markup and be general enough to allow inclusion of namespaces and change the values depending on this. > But some points to think about: > 1. What does to user should know already about the semantic of the > documents? exactly, he/she doesn't know, nor he/she should. This is what the heuristically associated values to tags are for. > 2. Does he/she have to know that a document has an author, for example? Well, some metadata (like library indexes, for examples) are very well established and might not confuse the user if presented in ad advanced query form. > 3. Does he/she have to know that querying for author entering > "author:john" will search of the author's name. Absolutely not! This will be done by the web application. > Perhaps all 3 issues are just a questing of design the GUI of > an semantic search... Yes and no. 3) calls for a better web app, that's for sure, but 1) IMO calls for a heuristic system that currently is hardwired into the HTML nature of the web content, but we have to abandon give the flexibility of the XML model. > Just read now > http://localhost:8080/cocoon/documents/emotional-landscapes.html, > I see, semantic is taken the xml element's into account. Yes, more or less this is the meaning I give to the word. > > > How should I index? > > > > Eh, good question :) > > > > My suggestion would be to connect the same xlink-based crawling > > subsystem used for CLI to lucene as it was a file system, but this > > mightrequire some Inversion of Control (us pushing files into > > lucene and not > > lucene to crawl them or read them from disk) thus some code > > changes to > > it. > > I understand your hint. Great! > I must admit that I never understood cocoon's view concept. Very few do. In fact, even Giacomo didn't understand them at first when he implemented the sitemap and they are still left in an unknown state. I hope to be able to provide some docs to show the light on this soon. > Now I see what I can do using views. Yes, without views, Cocoon will be only harmful for the semantic web effort (see a pretty old RT "is Cocoon harmful for the semantic web" on this list, also picked up on xmlhack.com). > Perhaps adding an example in the view documentation, like > Try using: > http://localhost:8080/cocoon/welcome?cocoon-view=content, or > http://localhost:8080/cocoon/welcome?cocoon-view=links > would help a lot. > But perhaps I'm just a bit slow.... No, don't worry, the concepts are pretty deep into the abstract reasoning of how a web should work in the future and there is no docs explaining this. > I never supposed to index the html result of an page, > but the xml content (ad fontes!). > Thus I was thinking about how to index a xml source. > > Or saying a more generally: > What would be a smart xml indexing strategy? Ok, second step: the indexing algorithm. Warning: I know nothing of text indexing nor the algorithms associated to these problems! > Lets take an snippet of > http://localhost:8080/cocoon/documents/views.html?cocoon-view=content > > ----- begin > .... > <s1 title="The Views"> > <s2 title="Introduction"> > <p> Views are yet another sitemap component. Unlike the rest, they > are othogonal to the resource and pipeline definitions. In the > ... > <s3 title="View Processing"> > <p>The samples sitemap contains two view definitions. One of them > looks like the excerpt below.</p> > <source xml:space="preserve"> > > <map:views> > <map:view name="content" from-label="content"> > <map:serialize type="xml"/> > </map:view> > > </source> > .... > ----- end > > I see following options: > 1. Index only the bare text. That's simple, and stupid, > as a lot of info entered by the xml generator (human, program) > is ignored. Yes. It's already powerful as we are able, for example, to index picture text out of SVG files or PDF files without requiring PDF parsing, but it is admittedly a waste of precious information. It could be a first step, though. > 2. Try to take the element's name, and/or attributes into account. > 3. Try to take the elements path into account. I would suggest taking the heuristical value of the path into account, rather than the path itself. > Let's see what queries an engine should answer: > ad 1. query: "Intro", result: all docs having text cocoon > > ad 2. query: "title:Intro", result: all docs having title elements with > text Intro. > > ad 2. query: "source:view", result: all docs having some source code > snippet regarding cocoon view concept. > > ad 3. query: "xpath:**/s2/title/Intro", result all docs having s2 title > Intro, not sure about this how to marry lucene with xpath don't know the internals of Lucene, but maybe associating some numerical values to text is useful to increase the ordering of importance. well, maybe we should ask the lucene guys for this. > I will try to implement something like that... > > Design-Draft > > 1. Crawling: > Usign the above described cocoon view-based crawling subsystem > > 2. Indexing: > 2.1 Each element-name will create a lucene field having the > same name as the element-name. > (?What about the element's name space, should I take it into account?) Yes, it should identify the schema used to get the heuristic mapping. Also, there could be mixed heuristical mappings, for example, between docbook namespace and dublin core namespace. > 2.2 Each attribute of an element will create a lucene field having > the concated name of the element-name, and the attribute-name. > 2.3 Having a field named body for the bare text. > > 3. Searching > Just use the lucene search engine. I think this is a good starting point, yes. > (btw, > I was already playing with lucene for indexing/searching mail messages > stored in mbox. This way I was searching the > http://xml.apache.org/mails/200109.gz, > > Wouldn't it be nice to generate FAQ, etc from the mbox mail messages. > But that's a semantic problem, as the mail messages have poor > xml-semantic content :-) Yes, even if, in theory, we all use things like *STRONG* _emphasis_ LOUD "quote" and the like. This is, in fact, markup in the most general sense :) > > Note that "dynamic" has a different sense that before and it means > > thatthe resource result is not dependent on request-based or > > environmentalparameters (such as user-agent, date, time, machine > > load, IP address, > > whatever). A resource that is done aggregating a ton of documents > > storedon a database must be considered static if it is not > > dependent of > > request parameters. > > > > For a semantic crawler, instead of asking for the "standard" view, it > > would ask for semantic-specific views such as "content" (the most > > semantic stage at pipeline generation, which we already specify in our > > example sitemaps) or "schema" (not currently implemented as nobody > > woulduse it today anyway). > > > > But the need of resource "views" is the key to the success of proper > > search capabililities and we must be sure that we use them even for > > semantically-poor searching solutions like lucene, but that would kick > > ass anyway on small to medium size web sites. > > > > Hope this helps and if you have further questions, don't mind asking. > > thanks for your suggestions, helping a lot to understand cocoon better. Hope this helps even more :) Ciao. -- Stefano Mazzocchi One must still have chaos in oneself to be able to give birth to a dancing star. <[EMAIL PROTECTED]> Friedrich Nietzsche -------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]