Re: "Advanced" query language

Joaquin Delgado Mon, 19 Dec 2005 16:45:40 -0800

Comments in-line

Wolfgang Hoschek wrote:

Yes, there are interesting impls out there. I've myself implementedXQuery fulltext search via extension functions build on Lucene. Seehttp://dsd.lbl.gov/nux/index.html#Google-like%20realtime%20fulltext%20search%20via%20Apache%20Lucene%20engine
However, rather than targetting fulltext search of infrequent queriesover huge persistent data archives (historic search), Nux targetsstreaming fulltext search of huge numbers of queries overcomparatively small transient realtime data (prospective search),e.g. 100000 queries/sec ballpark. Think XML router. That's probablydistinctly different than what many (most?) other folks would like todo, and requires a different, somewhat non-standard, architecture.
[The underlying lucene code lives in lucene SVN in the lucene/contrib/memory module, the remainder is in Nux.]
Implementing XQuery in full compliance with the spec is a rathergigantic undertaking. Separating the XQuery language and the fulltextlanguage greatly simplified the system design, and made it moreflexible and extensible.

[JOAQUIN] One of the arguable advantage of this new XQuery FT draft isthat the semantics (http://www.w3.org/TR/xquery-full-text/#tq-semantics)are defined using XQuery functions, thus it is relatively easy to builda "dumb" XQuery-FT compliant engine using these definitions :-) Here isa Java based XQuery engine developed in Cornell that satisfies most ofthe working draft's requirements:

http://www.cs.cornell.edu/database/Quark/quark_main.html

Further, consider that tulltext search capabilities are typicallyquite open ended and context/application specific. Seems to me thatthat's one of the reasons why lucene is more a set of interfaces anddiverse building blocks than a complete end user system. I find itdifficult to believe that making the fulltext language an *integralpart of XQuery* will enable sufficient "extension points" to provemeaningful to end users and implementors. Standards evolve at aglacial pace; it effectively means that most or all flexibility islost. I tend to think that the W3C is jumping the gun and attemptingto standardize what is more an R&D concept than a well understood setof capabilities across a wide range of actual real world use cases,and it does so in a non-modular manner.

Full-text search remains open ended and context/app specific thus itmakes sense to leave Lucene as is and still have, for example Nutch.However the moment you are promoting INTEROPERABILITY with othersearch/retrieval systems by XMLizing the query input and the resultoutput, like Mark is, then it makes sense to adhere to standards and thestandard to query XML is XQuery. Because of the nature of the data (XML)full-text becomes a *must* requirement of the standard. If Mark comes upwith yet another query language with some custom tags it would bedenying the fact that search systems need to communicate among them andthus re-inventing the wheel. Besides, almost 80% of all full-textoperators (Boolean, wildcards, proximity, etc.) just differ in syntaxfrom one search engine to another. Just look at another "Common QueryLanguage" now being used by the Library of Congress(http://www.loc.gov/standards/sru/cql/) for federated search.

Maybe I'm being too ambitious here but if we have an implementation ofXQuery-FT compliant XQuery engine on top of Lucene indices or at theminimum _Lucene could interpret XPath queries_ where element node labelsare equivalent to Lucene fields we begin thinking of exposing Lucenesources to more sophisticated and distributed XQuery engines, thusproviding full XML support on any Lucene based system. UnfortunatelyLucene does not support nested fields but that is OK for now.


-- Joaquin


On Dec 17, 2005, at 5:43 PM, [EMAIL PROTECTED] wrote:

Paul and  Wolfang,

Thank you very much for your input. I think there are two distinctproblems that have emerged from this thread:1) The ability to create efficient structures to index and query XMLdocuments (element, attributes and corresponding values) with afull-text query language and perforators. After all XML is text. AsPaul pointed out people have already tried this with Lucene.2) The need for a standard query language like XQuery aiming atsystem interoperability in the now XMLized world that has the sameeffect that SQL had in the relational world.

While I can see how in the SQL case extension functions can be usedto implement full-text capabilities, in the XML case full-text isrequired to query and retrieve XML (sub-document) elements andattributes based on the free text (natural language) values ANDalso to query the strings that represent the structure itself. Forexample, in simple SQL queries the names of the tables and columnsneed to be known to project corresponding values and are not part ofthe search conditions (in WHERE clauses only values corresponding totable/columns are evaluated).

In XQuery both the structure and the content are searchable, thusrequiring full-text operators. That is why XQuery Full-Text requiresthe unification and standardization both XQuery and Full- Text"languages". Needless is to say that the implementation will differfrom system to system.

I do agree though that the abstraction of full-text capabilitiesthrough functional extensions is a great first step. Check outOracle's XML Query Service (http://www.oracle.com/technology/tech/xml/xds/index.html and , http://www.oracle.com/technology/oramag/oracle/05-mar/o25xml.html) a Java based XQuery engine that hasabstracted "data sources" such as Web Services, RDBMS, etc. asfunctions that while returning XML can receive parameters and supplyfull-text capabilities. If Mark's implementation of Lucene query andoutput in XML comes to fruition a Lucene data source will become yetanother stream of XML that can be queried, processed and rendered bythe mid-tier XQuery engine.


-- Joaquin



While maintaining my bookmarks I ran into this:
"Case Study: Enabling Low-Cost XML-Aware Searching
Capable of Complex Querying":

http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-02-08/03-02-08.html


Some loose thoughts:

In the system described there a Lucene document is used for each

low level xml construct, even when it contains very few charactersof text.

The resulting Lucene indexes are at least 2.5 times the size of the

original document, which is not a surprise given this documentstructure.

Normal index size is about one third of  the indexed text.

I don't know about the XQuery standard, but I was wondering
whether this unusual document structure and the non straightforward
fit between Lucene queries and XQuery queries are related.

As for the  joines and iterations over items from the stream of XML
results: iteration over matching XML constructs should be no problem
in Lucene. Joins in Lucene are normally done via boolean filters,
so I was wondering how XQuery joins fit these.
The case study above has a note a the end of par 5.3:
"The Search Result list that comes back could then be organized
by document id to group together all the results for a single XML
document. This is not provided by default, but has been done with
extension to this code."

Regards,
Paul Elschot

On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote:

I think implementing an XQuery Full-Text engine is far beyond the
scope of Lucene.

Implementing a building block for the fulltext aspect of it would be
more manageable. Unfortunately The W3C fulltext drafts
indiscriminately mix and mingle two completely different languages
into a single language, without clear boundaries. That's why most
practical folks implement XQuery fulltext search via extension
functions rather than within XQuery itself. This also allows for much
more detailed tokenization, configuration and extensibility than what
would be possible with the W3C draft.

Wolfgang.

On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote:

Mark,

This is very cool. When I was at TripleHop we did something very
similar where both query and results conformed to an XML Schema and
we used XML over HTTP as our main vehicle to do remote/federated
searches with quick rendering with stylesheets.

That however is the first piece of the puzzle. If you really want
to go beyond search (in the traditional sense) and be able to
perform more complex operations such as joines and iterations over
items from the stream of XML results you are getting you should
consider implementing an XQuery Full-Text engine with Lucene
adopting the now standard XQuery language.

Here is the pointer to the working draft on the W3C working draft
on XQuery 1.0 and XPath 2.0 Full-Text:
http://www.w3.org/TR/xquery-full-text/

Now I'm part of the task force editing this draft so your comments
are very much welcomed.

-- J.D.


http://www.inperspective.com/lucene/LXQueryV0_1.zip

I've implemented just a few queries (Boolean, Term, FilteredQuery,
BoostingQuery ...) but other queries are fairly trivial to add.
At this stage I am more interested in feedback on parser design/
approach
rather than trying to achieve complete coverage of all the Lucene
Query
types or debating the choice of tag names.

Please see the readme.txt in the package for more details.

Cheers
Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


While maintaining my bookmarks I ran into this:
"Case Study: Enabling Low-Cost XML-Aware Searching
Capable of Complex Querying":

http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-02-08/03-02-08.html


Some loose thoughts:

In the system described there a Lucene document is used for each

low level xml construct, even when it contains very few charactersof text.

The resulting Lucene indexes are at least 2.5 times the size of the

original document, which is not a surprise given this documentstructure.

Normal index size is about one third of  the indexed text.

I don't know about the XQuery standard, but I was wondering
whether this unusual document structure and the non straightforward
fit between Lucene queries and XQuery queries are related.

As for the  joines and iterations over items from the stream of XML
results: iteration over matching XML constructs should be no problem
in Lucene. Joins in Lucene are normally done via boolean filters,
so I was wondering how XQuery joins fit these.
The case study above has a note a the end of par 5.3:
"The Search Result list that comes back could then be organized
by document id to group together all the results for a single XML
document. This is not provided by default, but has been done with
extension to this code."

Regards,
Paul Elschot

On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote:

I think implementing an XQuery Full-Text engine is far beyond the
scope of Lucene.

Implementing a building block for the fulltext aspect of it would be
more manageable. Unfortunately The W3C fulltext drafts
indiscriminately mix and mingle two completely different languages
into a single language, without clear boundaries. That's why most
practical folks implement XQuery fulltext search via extension
functions rather than within XQuery itself. This also allows for much
more detailed tokenization, configuration and extensibility than what
would be possible with the W3C draft.

Wolfgang.

On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote:

Mark,

This is very cool. When I was at TripleHop we did something very
similar where both query and results conformed to an XML Schema and
we used XML over HTTP as our main vehicle to do remote/federated
searches with quick rendering with stylesheets.

That however is the first piece of the puzzle. If you really want
to go beyond search (in the traditional sense) and be able to
perform more complex operations such as joines and iterations over
items from the stream of XML results you are getting you should
consider implementing an XQuery Full-Text engine with Lucene
adopting the now standard XQuery language.

Here is the pointer to the working draft on the W3C working draft
on XQuery 1.0 and XPath 2.0 Full-Text:
http://www.w3.org/TR/xquery-full-text/

Now I'm part of the task force editing this draft so your comments
are very much welcomed.

-- J.D.


http://www.inperspective.com/lucene/LXQueryV0_1.zip

I've implemented just a few queries (Boolean, Term, FilteredQuery,
BoostingQuery ...) but other queries are fairly trivial to add.
At this stage I am more interested in feedback on parser design/
approach
rather than trying to achieve complete coverage of all the Lucene
Query
types or debating the choice of tag names.

Please see the readme.txt in the package for more details.

Cheers
Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: "Advanced" query language

Reply via email to