[jira] [Updated] (DERBY-590) How to integrate Derby with Lucene API?

Rick Hillegas (JIRA) Mon, 17 Feb 2014 16:25:58 -0800

     [ 
https://issues.apache.org/jira/browse/DERBY-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rick Hillegas updated DERBY-590:
--------------------------------

    Attachment: derby-590-01-ag-publicAccessToLuceneRoutines.diff

Attaching derby-590-01-ag-publicAccessToLuceneRoutines.diff. This patch builds 
on Andrew's approach, wiring Lucene support into Derby's SQL authorization 
scheme. If this approach looks promising, I will write additional regression 
tests to verify the claims I make about this solution.

Before discussing what this patch does, let me recap what it does NOT do:

1) It does NOT encrypt Lucene indexes if the database is encrypted.

2) It does NOT work with in-memory databases.

3) It does NOT work with backup/restore.

In addition, I have changed the meaning of the updateIndex() procedure and lost 
the very useful meaning which it used to have. There are at least two important 
application profiles which have different needs from an index-updating 
procedure:

i) Update-intensive: These applications do not want to incur the performance 
hit of re-indexing a document every time it changes. These applications may 
work best with a bulk-reindexing cron job which executes when user activity is 
low. This is the use-case supported by the revised updateIndex() procedure: the 
Lucene index is dropped and completely recreated although schema objects 
connected to it are not bounced.

ii) Read-intensive: These applications can incur the performance hit of 
re-indexing a document every time it changes. These applications are well 
served by the trigger-driven usage of updateIndex() provided in Andrew's 
original patch.

I think that it may be possible to recover Andrew's original functionality. But 
this requires more familiarity with Lucene than I can claim. Maybe Andrew can 
propose some ideas and we can converge on a solution.

So much for the caveats. Here are the changes made by the new patch:

A) The layout of the Lucene support directory is changed. There is now a 
subdirectory for each schema which has indexed tables in it. Under that 
directory, there is a subdirectory for each table which has indexed columns. 
Under the table-specific directory, there is a subdirectory for each text 
column which is indexed. Those are the leaf directories which contain the 
actual Lucene files. So the directory structure looks like this:

{noformat}
databaseName
    seg0
    log
    lucene
        SCHEMA1
            TABLE1
                TEXT_COLUMN1
                TEXT_COLUMN2
...
{noformat}

B) The luceneQuery() table function has been changed to be context aware. See 
the work recently done on https://issues.apache.org/jira/browse/DERBY-6117. 
Users no longer invoke luceneQuery() directly. Instead, for each indexed 
column, we create a table function specific to that column. The table function 
has the name $schemaName.$tableName__$columnName. This is the crucial change 
which allows us to wire Lucene support up to Derby's SQL authorization. The 
schema owner can grant EXECUTE privilege on the column-specific table function. 
In addition, the underlying context-aware luceneQuery() machinery starts out by 
issuing a select against the text column and the key columns of the table. This 
ensures that users must enjoy SELECT privilege on all of those columns in order 
to run the column-specific table function.

C) A text column can be indexed only if the table has a primary key. The 
primary key can have multiple columns in it and the columns can be any 
indexable datatype. The column-specific table function returns a data set which 
includes the whole primary key plus the Lucene document id plus the rank number 
(the relevance of that document according to Lucene's calculations). This makes 
it easy to join the table function to the original table in order to obtain 
more extensive results.

D) The arguments to the Lucene support procedures are now case-insensitive SQL 
identifiers rather than case-sensitive strings. This is a departure from the 
convention followed by most Derby system procedures, but I think the departure 
is welcome and long overdue.

E) The Lucene support has been moved out of the tools jar and into the engine 
jar. This fixes the big surprise I recorded on 
https://issues.apache.org/jira/browse/DERBY-6470. This change also made it 
possible to use IdUtil in order to implement the semantics described above in 
D).

F) It is assumed that the Lucene jars will be checked into the Derby source 
tree and that the Lucene support classes will always be built. However, there 
is still no plan to include Lucene jars in Derby binary distributions. Users 
who want to enable the optional Lucene support will have to install the Lucene 
jars themselves.

G) In general, the following convention has been followed for each supported 
operation: Transactional writes are performed before any calls are made to 
Lucene. This means that if the Lucene calls raise an error, then the 
transactional writes are rolled back. I think that following a consistent 
convention like this will make it easier for users to reason about how Lucene 
support behaves.

H) Miscellaneous improvements have been made in the areas of code factoring and 
integration with the Java security manager.

Here's how you use the revised tool:

Let's say that you have a table with this shape...

{noformat}
create table lucenetest.titles
(
    ID int generated always as identity primary key,
    ISBN varchar(16),
    PRINTISBN varchar(16),
    title varchar(1024),
    subtitle varchar(1024),
    author varchar(1024),
    series varchar(1024),
    publisher varchar(1024),
    collections varchar(128),
    collections2 varchar(128)
);
{noformat}

The DBO loads the Lucene support tool as follows...

{noformat}
call syscs_util.syscs_register_tool( 'luceneSupport', true );
{noformat}

This creates the LuceneSupport schema, containing the following procedures and 
functions...

{noformat}
create procedure LuceneSupport.createIndex
(
    schemaname varchar( 128 ),
    tablename varchar( 128 ),
    textcolumn varchar( 128 )
)
parameter style java modifies sql data language java
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.createIndex';

create procedure LuceneSupport.dropIndex
(
    schemaname varchar( 128 ),
        tablename varchar( 128 ),
        textcolumn varchar( 128 )
)
parameter style java modifies sql data language java
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.dropIndex';

create procedure LuceneSupport.updateIndex
(
    schemaname varchar( 128 ),
        tablename varchar( 128 ),
        textcolumn varchar( 128 )
)
parameter style java reads sql data language java
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.updateIndex';

create function LuceneSupport.listIndexes() returns table
(
    id int,
        schemaname char( 128 ),
        tablename char( 128 ),
        columnname char( 128 ),
        lastupdated timestamp
)
language java parameter style DERBY_JDBC_RESULT_SET contains sql
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.listIndexes'
{noformat}

The LUCENETEST user can then index a text column as follows. The DBO can do 
this too. However, no-one else can index the column. That is because the 
createIndex() procedure attempts to create a table function in the lucenetest 
schema. Other users do not enjoy that privilege...

{noformat}
call LuceneSupport.createIndex( 'lucenetest', 'titles', 'title' );
{noformat}

...which creates the following directory structure...

{noformat}
databaseName
    seg0
    log
    lucene
        LUCENETEST
            TITLES
                TITLE
...
{noformat}

...and the following column-specific table function:

{noformat}
create function lucenetest.titles__title( query varchar( 32672 ), rankCutoff 
double )
returns table
(
    ID int,
    documentID int,
        rank double
)
language java parameter style derby_jdbc_result_set contains sql
external name 'org.apache.derby.impl.optional.lucene.LuceneSupport.luceneQuery';
{noformat}

The LUCENETEST user can then query the Lucene index and join it to the original 
table as follows...

{noformat}
select title, author, publisher, documentID, rank
from lucenetest.titles t, table ( lucenetest.titles__title( 'grapes', 0 ) ) l
where t.id = l.id;
{noformat}

The LUCENETEST user (or the DBO but no-one else) can drop the Lucene index as 
follows...

{noformat}
call LuceneSupport.dropIndex( 'lucenetest', 'titles','title' );
{noformat}

...which drops the lucene/LUCENETEST/TITLES/TITLE directory. Directory deletion 
cascades up. That is, if there are no more indexed columns in 
lucenetest.titles, then we also drop lucene/LUCENETEST/TITLES. And we drop 
lucene/LUCENETEST if there are no more indexed tables in the lucenetest schema.

The DBO (and only the DBO) can unload Lucene support as follows...

{noformat}
call syscs_util.syscs_register_tool( 'luceneSupport', false );
{noformat}

...which drops all procedures and table functions created by the tool. This 
command also drops the lucene subdirectory and everything under it.


Touches the following files:

--------------------

M       java/engine/org/apache/derby/impl/sql/compile/StaticMethodCallNode.java

Adds the Lucene support package to the list of Derby packages whose entry 
points can be bound to user-defined routines.

--------------------

A       java/engine/org/apache/derby/impl/optional
A       java/engine/org/apache/derby/impl/optional/lucene
A       java/engine/org/apache/derby/impl/optional/lucene/LuceneQueryVTI.java
A       java/engine/org/apache/derby/impl/optional/lucene/LuceneSupport.java
A       
java/engine/org/apache/derby/impl/optional/lucene/LuceneListIndexesVTI.java
A       java/engine/org/apache/derby/impl/optional/lucene/build.xml
M       java/engine/org/apache/derby/impl/build.xml
M       java/engine/org/apache/derby/catalog/Java5SystemProcedures.java

The Lucene support optional tool.

--------------------

M       java/engine/org/apache/derby/vti/VTITemplate.java

Some additional support for context-aware table functions. This support allows 
the function to query the JDBC metadata for the shape of the ResultSet it 
returns.

--------------------

M       java/engine/org/apache/derby/loc/messages.xml
M       java/shared/org/apache/derby/shared/common/reference/SQLState.java

New error messages for Lucene support.


--------------------

M       build.xml
A       tools/java/lucene-analyzers-common-4.5.0.jar
A       tools/java/lucene-queryparser-4.5.0.jar
A       tools/java/lucene-core-4.5.0.jar
M       tools/ant/properties/extrapath.properties
M       tools/jar/extraDBMSclasses.properties

Miscellaneous build machinery.

--------------------

A       tools/release/notices/lucene.txt

The Lucene notice file which must be included in the Derby NOTICE file if we 
are to ship the lucene jars in Derby source distributions.

--------------------

M       
java/testing/org/apache/derbyTesting/functionTests/tests/lang/_Suite.java
A       
java/testing/org/apache/derbyTesting/functionTests/tests/lang/LuceneSupportTest.java

Initial tests.


> How to integrate Derby with Lucene API?
> ---------------------------------------
>
>                 Key: DERBY-590
>                 URL: https://issues.apache.org/jira/browse/DERBY-590
>             Project: Derby
>          Issue Type: Improvement
>          Components: Documentation, SQL
>            Reporter: Abhijeet Mahesh
>              Labels: derby_triage10_11
>         Attachments: derby-590-01-ag-publicAccessToLuceneRoutines.diff, 
> lucene_demo.diff, lucene_demo_2.diff
>
>
> In order to use derby with lucene API what should be the steps to be taken? 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (DERBY-590) How to integrate Derby with Lucene API?

Reply via email to