Re: Lucene integration

Geoff hendrey Wed, 18 Mar 2009 06:49:45 -0700

I've been folowing knuts pointers and reading the docs on the classes that 
marshal themselves over the wire via their writeObject method.

So, question about this:
"Type=update, Table=employee, Page=4321, Index=4, field 3=50000"

Does the page and index, collectively, constitute a "row ID".
If it is always a constant, than these three field are sufficient to 
permanently identify the row, and we can use that information to consititute a 
document ID in lucene.

Lucene is incredibly flexible with regard to strategies for indexing content. 
For example, the obvious strategy I see is to make every log entry a *logical* 
document in the lucene index. My ideal way of using lucene with derby would 
actually provide a historical search so that you could view old versions of 
rows, if you so desired (kind of like being able to browse/search the history 
of the row).

So, any other suggestions on how to get the data into Lucene? I still think 
that what I, as a consumer of this feature, would like, is a way to search 
lucene, and to have search results include a "ROW ID" (whatever that is), that 
I can subsequently use to correlate the lucene-indexed data back to the 
database (I'm not concerned about loss of transactional integrity). Also, keep 
in mind that most of the time, Lucene will simply provide the answer to my 
query, and I won't actually *need* to go back to derby at all.

 -geoff
“XML? Too much like HTML. It'll never work on the Web!” 
-anonymous 

________________________________
From: Jørgen Løland <[email protected]>
To: Derby Discussion <[email protected]>
Sent: Wednesday, March 18, 2009 12:46:07 AM
Subject: Re: Lucene integration

Geoffrey Hendrey wrote:
> Yeah, that's exactly what I had in mind. The lucene server would receive the 
> binary data, unpack it, and use the Lucene API to create and modify the 
> Lucene index. 
> Does derby have utilities for unmarshalling the ".dat" format? There must be. 
> Otherwise there must at least be a clear spec for the binary format.

I'm afraid it's not as easy as unmarshalling the log. The main problem is that 
Derby uses a physical log format; when you do an

"UPDATE employee SET salary=50000 WHERE empid=123"

... the corresponding log record looks something like this:

"Type=update, Table=employee, Page=4321, Index=4, field 3=50000"

Not accurate format, but you get the idea.

Without knowing which record is in index 4 on page 4321, there's no way to know 
which record is updated. Hence, there's nothing to feed into Lucene.

I'm not saying there's no way to hook Lucene and Derby replication together, 
but I doubt that it can be done without having the entire database in the 
receiving end.

> I'd like to clarify that lucene doesn't expect anything as "input". I think 
> this is a common misconception about lucene. In fact, although lucene does 
> have some sample index builders for document formats like HTML, the reality 
> is that almost all applications that use lucene simply parse the data that 
> needs to be indexed and use the lucene IndexWriter to manually stuff the 
> desired information into the index. There is no need to retain the original 
> data source. In my current project I build a 12gb lucene index from parsing 
> 1,950,000 source data records.
> 
> On Mar 16, 2009, at 7:00 AM, Jørgen Løland <[email protected]> wrote:
> 
> Geoffrey Hendrey wrote:
> Would it be possible for the derby team to implement lucene support in the 
> following way? Hook into the asynchronous replication protocol to send 
> committed transactions to a lucene receiver. I think it is acceptable for the 
> free text search to only "see" committed data. Alterative to opening the 
> protocol would be to create an abstract ReceiverServer for asynchronous data, 
> then LuceneReceiver is just a subclass. Thoughts? 
> What does Lucene expect as input? I doubt that the replication code can be 
> easily integrated with Lucene because...
> 
> 1) The information replication sends from a master to a slave is a physical 
> transaction log, which is in a derby-internal format. It is not human 
> readable. To get an idea of what it looks like, you can take a look at 
> logN.dat in one of your databases' log/ directories.
> 2) Replication does not distinguish between committed and uncommitted data; 
> log for all transactions, committed or not, is sent to the slave.
> 
> This means that before anything is fed into Lucene, the information has to be 
> processed. This processing is effectively Derby's crash recovery code and is 
> non-trivial to extract.
> 
> Note that I'm not familiar with Lucene.
> 

-- Jørgen Løland

Re: Lucene integration

Reply via email to