I've looked at Mark's concept and code, and, IMHO, his implementation is
well-done and addresses a huge need. It allows you to conduct Lucene
searches that can harness all the power of the latest Query objects,
without any special Java coding. Yet it also allows the user to be
presented with a simple query interface.
It is not, as Mark indicates, an effort to form a generic, standardized
query language, and should not be confused as such. Perhaps the term
"Advanced" query language might be changed to more accurately reflect
it's function (though at the moment my brain isn't coming up with any
good alternative descriptions). Maybe "dynamic query configuration
mechanism" or something.
--Terry
mark harwood wrote:
However the moment you are promoting INTEROPERABILITY
with other
search/retrieval systems by XMLizing the query input
and the >result output, like Mark is, then it makes
sense to adhere to >standards
I think this is hijacking my original intentions to
some extent. I may be accused of being short-sighted
but I wasn't proposing a language for interoperability
with other search systems or query standards. That
approach suggests a constraining "lowest common
denominator" effect which is at odds with my original
intentions.
What I was looking for was simply a way to fill the
gap between the current QueryParser syntax and the
growing list of powerful Lucene features that can't be
represented in it's syntax (spans, regexes,
FilteredQuery, LikeThis....).
I outlined why a String representation of queries was
desirable in my original post (fundamentally:
persistence, distribution and language independence).
My use of XML was intended to meet the above
objectives and give FULL coverage of all Lucene
features. Search system interoperability wasn't on my
list and (correct me if I'm wrong here) adding it
would preclude some of the more exotic Lucene features
eg "MoreLikeThis" or "BoostingQuery".
I'm all for standards/interoperability in the new
query language if it:
a) Doesn't become a nightmare to implement
b) Allows all of the Lucene query functionality to be
exposed
c) Is a real requirement for enough Lucene users
I'm just not sure that any/all of these conditions are
true.
Maybe there needs to be a separate "interoperability"
language development?
Cheers
Mark
--- Joaquin Delgado <[EMAIL PROTECTED]>
wrote:
Comments in-line
Wolfgang Hoschek wrote:
Yes, there are interesting impls out there. I've
myself implemented
XQuery fulltext search via extension functions
build on Lucene. See
http://dsd.lbl.gov/nux/index.html#Google-like%20realtime%20fulltext%
20search%20via%20Apache%20Lucene%20engine
However, rather than targetting fulltext search of
infrequent queries
over huge persistent data archives (historic
search), Nux targets
streaming fulltext search of huge numbers of
queries over
comparatively small transient realtime data
(prospective search),
e.g. 100000 queries/sec ballpark. Think XML
router. That's probably
distinctly different than what many (most?) other
folks would like to
do, and requires a different, somewhat
non-standard, architecture.
[The underlying lucene code lives in lucene SVN in
the lucene/contrib/
memory module, the remainder is in Nux.]
Implementing XQuery in full compliance with the
spec is a rather
gigantic undertaking. Separating the XQuery
language and the fulltext
language greatly simplified the system design, and
made it more
flexible and extensible.
[JOAQUIN] One of the arguable advantage of this new
XQuery FT draft is
that the semantics
(http://www.w3.org/TR/xquery-full-text/#tq-semantics)
are defined using XQuery functions, thus it is
relatively easy to build
a "dumb" XQuery-FT compliant engine using these
definitions :-) Here is
a Java based XQuery engine developed in Cornell that
satisfies most of
the working draft's requirements:
http://www.cs.cornell.edu/database/Quark/quark_main.html
Further, consider that tulltext search
capabilities are typically
quite open ended and context/application specific.
Seems to me that
that's one of the reasons why lucene is more a set
of interfaces and
diverse building blocks than a complete end user
system. I find it
difficult to believe that making the fulltext
language an *integral
part of XQuery* will enable sufficient "extension
points" to prove
meaningful to end users and implementors.
Standards evolve at a
glacial pace; it effectively means that most or
all flexibility is
lost. I tend to think that the W3C is jumping the
gun and attempting
to standardize what is more an R&D concept than a
well understood set
of capabilities across a wide range of actual real
world use cases,
and it does so in a non-modular manner.
Full-text search remains open ended and context/app
specific thus it
makes sense to leave Lucene as is and still have,
for example Nutch.
However the moment you are promoting
INTEROPERABILITY with other
search/retrieval systems by XMLizing the query input
and the result
output, like Mark is, then it makes sense to adhere
to standards and the
standard to query XML is XQuery. Because of the
nature of the data (XML)
full-text becomes a *must* requirement of the
standard. If Mark comes up
with yet another query language with some custom
tags it would be
denying the fact that search systems need to
communicate among them and
thus re-inventing the wheel. Besides, almost 80% of
all full-text
operators (Boolean, wildcards, proximity, etc.) just
differ in syntax
from one search engine to another. Just look at
another "Common Query
Language" now being used by the Library of Congress
(http://www.loc.gov/standards/sru/cql/) for
federated search.
Maybe I'm being too ambitious here but if we have an
implementation of
XQuery-FT compliant XQuery engine on top of Lucene
indices or at the
minimum _Lucene could interpret XPath queries_ where
element node labels
are equivalent to Lucene fields we begin thinking
of exposing Lucene
sources to more sophisticated and distributed XQuery
engines, thus
providing full XML support on any Lucene based
system. Unfortunately
Lucene does not support nested fields but that is OK
for now.
-- Joaquin
On Dec 17, 2005, at 5:43 PM,
[EMAIL PROTECTED] wrote:
Paul and Wolfang,
Thank you very much for your input. I think there
are two distinct
problems that have emerged from this thread:
1) The ability to create efficient structures to
index and query XML
documents (element, attributes and corresponding
values) with a
full-text query language and perforators. After
all XML is text. As
Paul pointed out people have already tried this
with Lucene.
2) The need for a standard query language like
XQuery aiming at
system interoperability in the now XMLized world
that has the same
effect that SQL had in the relational world.
While I can see how in the SQL case extension
functions can be used
to implement full-text capabilities, in the XML
case full-text is
required to query and retrieve XML (sub-document)
elements and
attributes based on the free text (natural
language) values AND
also to query the strings that represent the
structure itself. For
example, in simple SQL queries the names of the
tables and columns
need to be known to project corresponding values
and are not part of
the search conditions (in WHERE clauses only
values corresponding to
table/columns are evaluated).
In XQuery both the structure and the content are
searchable, thus
requiring full-text operators. That is why XQuery
Full-Text requires
the unification and standardization both XQuery
and Full- Text
"languages". Needless is to say that the
implementation will differ
from system to system.
I do agree though that the abstraction of
full-text capabilities
through functional extensions is a great first
step. Check out
Oracle's XML Query Service
(http://www.oracle.com/technology/tech/
xml/xds/index.html and ,
http://www.oracle.com/technology/oramag/
oracle/05-mar/o25xml.html) a Java based XQuery
engine that has
abstracted "data sources" such as Web Services,
RDBMS, etc. as
functions that while returning XML can receive
parameters and supply
full-text capabilities. If Mark's implementation
of Lucene query and
output in XML comes to fruition a Lucene data
source will become yet
another stream of XML that can be queried,
processed and rendered by
the mid-tier XQuery engine.
-- Joaquin
While maintaining my bookmarks I ran into this:
=== message truncated ===
___________________________________________________________
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]