Yeah, that's exactly what I had in mind. The lucene server would receive the 
binary data, unpack it, and use the Lucene API to create and modify the Lucene 
index. 

Does derby have utilities for unmarshalling the ".dat" format? There must be. 
Otherwise there must at least be a clear spec for the binary format.

I'd like to clarify that lucene doesn't expect anything as "input". I think 
this is a common misconception about lucene. In fact, although lucene does have 
some sample index builders for document formats like HTML, the reality is that 
almost all applications that use lucene simply parse the data that needs to be 
indexed and use the lucene IndexWriter to manually stuff the desired 
information into the index. There is no need to retain the original data 
source. In my current project I build a 12gb lucene index from parsing 
1,950,000 source data records.

On Mar 16, 2009, at 7:00 AM, Jørgen Løland <[email protected]> wrote:

Geoffrey Hendrey wrote:
Would it be possible for the derby team to implement lucene support in the 
following way? Hook into the asynchronous replication protocol to send 
committed transactions to a lucene receiver. I think it is acceptable for the 
free text search to only "see" committed data. Alterative to opening the 
protocol would be to create an abstract ReceiverServer for asynchronous data, 
then LuceneReceiver is just a subclass. Thoughts? 

What does Lucene expect as input? I doubt that the replication code can be 
easily integrated with Lucene because...

1) The information replication sends from a master to a slave is a physical 
transaction log, which is in a derby-internal format. It is not human readable. 
To get an idea of what it looks like, you can take a look at logN.dat in one of 
your databases' log/ directories.
2) Replication does not distinguish between committed and uncommitted data; log 
for all transactions, committed or not, is sent to the slave.

This means that before anything is fed into Lucene, the information has to be 
processed. This processing is effectively Derby's crash recovery code and is 
non-trivial to extract.

Note that I'm not familiar with Lucene.

-- 
Jørgen Løland

Reply via email to