Re: Lucene external field storage contribution.

2004-11-09 Thread Terry Steichen
Kevin,

Sorry for the delay in replying.  I think your idea for an external field 
storage mechanism is excellent.  I'd love to see it, and if I can, will be 
willing to help make that happen.

Regards,

Terry
  - Original Message - 
  From: Kevin A. Burton 
  To: Lucene Users List 
  Sent: Sunday, November 07, 2004 4:47 PM
  Subject: Lucene external field storage contribution.


  About 3 months ago I developed a external storage engine which ties into 
  lucene. 

  I'd like to discuss making a contribution so that this is integrated 
  into a future version of Lucene.

  I'm going to paste my original PROPOSAL in this email. 

  There wasn't a ton of feedback first time around but I figure squeaky 
  wheel gets the grease...


  
  
   I created this proposal because we need this fixed at work. I want to 
   go ahead and work on a vertical fix for our version of lucene and then 
   submit this back to Jakarta.
   There seems to be a lot of interest here and I wanted to get feedback 
   from the list before moving forward ...
  
   Should I put this in the wiki?!
  
   Kevin
  
   ** OVERVIEW **
  
   Currently Lucene supports 'stored fields; where the content of these 
   fields are
   kept within the lucene index for use in the future.
  
   While acceptable for small indexes, larger amounts of stored fields 
   prevent:
  
   - Fast index merges since the full content has to be continually merged.
  
   - Storing the indexes in memory (since a LOT of memory would be 
   required and
   this is cost prohibitive)
  
   - Fast queries since block caching can't be used on the index data.
  
   For example in our current setup our index size is 20G.  Nearly 90% of 
   this is
   content.  If we could store the content outside of Lucene our merges and
   searches would be MUCH faster.  If we could store the index in MEMORY 
   this could
   be orders of magnitude faster.
  
   ** PROPOSAL **
  
   Provide an external field storage mechanism which supports legacy indexes
   without modification.  Content is stored in a content segment. The only
   changes would be a field with 3(or 4 if checksum enabled) values.
  
   - CS_SEGMENT
  
 Logical ID of the content segment.  This is an integer value.  
   There is
 a global Lucene property named CS_ROOT which stores all the 
   content.
 The segments are just flat files with pointers.  Segments are 
   broken
 into logical pieces by time and size.  Usually 100M of content 
   would be
 in one segment.
  
   - CS_OFFSET
  
 The byte offset of the field.
  
   - CS_LENGTH
  
 The length of the field.
  
   - CS_CHECKSUM
  
 Optional checksum to verify that the content is correct when 
   fetched
 from the index.
  
   - The field value here would be exactly 'N:O:L' where N is the segment 
   number,
 O is the offset, and L is the length.  O and L are 64bit values.  N 
   is a 32
 bit value (though 64bit wouldn't really hurt).
  
   This mechanism allows for the external storage of any named field.

   CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO 
   code for
   efficient content lookup.  (Though filehandle caching should probably 
   be used).
  
   Since content is broken into logical 100M segments the underlying 
   filesystem can
   orgnize the file into contiguous blocks for efficient non-fragmented 
   lookup.
  
   File manipulation is easy and indexes can be merged by simply 
   concatenating the
   second file to the end of the first.  (Though the segment, offset, and 
   length
   need to be updated).  (FIXME: I think I need to think about this more 
   since I
   will have  100M per syncs)
  
   Supporting full unicode is important.  Full java.lang.String storage 
   is used
   with String.getBytes() so we should be able to avoid unicode issues.  
   If Java
   has a correct java.lang.String representation it's possible easily add 
   unicode
   support just by serializing the byte representation. (Note that the 
   JDK says
   that the DEFAULT system char encoding is used so if this is ever 
   changed it
   might break the index)
  
   While Linux and modern versions of Windows (not sure about OSX) 
   support 64bit
   filesystems the 4G storage boundary of 32bit filesystems (ext2 is an 
   example)
   are an issue.  Using smaller indexes can prevent this but eventually 
   segment
   lookup in the filesystem will be slow.  This will only happen within 
   terabyte
   storage systems so hopefully the developer has migrated to another 
   (modern)
   filesystem such as XFS.
  
   ** FEATURES **
  
 - Must be able to replicate indexes easily to other hosts.
  
 - Adding content to the index must be CHEAP
  
 - Deletes need to be cheap (these are cheap for older content.  Just 
   discard
   older indexes)
  
 - Filesystem needs to be able to optimize storage
  
 - Must support UNICODE and binary content (images

Re: Lucene external field storage contribution

2004-11-08 Thread Miles Barr
 On Sun, 07 Nov 2004 13:51:23, Kevin A. Burton wrote:
 About 3 months ago I developed a external storage engine which ties
 into lucene. 
  
 I'd like to discuss making a contribution so that this is integrated
 into a future version of Lucene.
  
 I'm going to paste my original PROPOSAL in this email. 
  
 There wasn't a ton of feedback first time around but I figure squeaky
 wheel gets the grease...

I'd be interested in this type of functionality. Could you raise an
issue in bugzilla so it's easier to track?


Cheers,
-- 
Miles Barr [EMAIL PROTECTED]
Runtime Collective


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene external field storage contribution.

2004-11-07 Thread Kevin A. Burton
About 3 months ago I developed a external storage engine which ties into 
lucene. 

I'd like to discuss making a contribution so that this is integrated 
into a future version of Lucene.

I'm going to paste my original PROPOSAL in this email. 

There wasn't a ton of feedback first time around but I figure squeaky 
wheel gets the grease...



I created this proposal because we need this fixed at work. I want to 
go ahead and work on a vertical fix for our version of lucene and then 
submit this back to Jakarta.
There seems to be a lot of interest here and I wanted to get feedback 
from the list before moving forward ...

Should I put this in the wiki?!
Kevin
** OVERVIEW **
Currently Lucene supports 'stored fields; where the content of these 
fields are
kept within the lucene index for use in the future.

While acceptable for small indexes, larger amounts of stored fields 
prevent:

- Fast index merges since the full content has to be continually merged.
- Storing the indexes in memory (since a LOT of memory would be 
required and
this is cost prohibitive)

- Fast queries since block caching can't be used on the index data.
For example in our current setup our index size is 20G.  Nearly 90% of 
this is
content.  If we could store the content outside of Lucene our merges and
searches would be MUCH faster.  If we could store the index in MEMORY 
this could
be orders of magnitude faster.

** PROPOSAL **
Provide an external field storage mechanism which supports legacy indexes
without modification.  Content is stored in a content segment. The only
changes would be a field with 3(or 4 if checksum enabled) values.
- CS_SEGMENT
  Logical ID of the content segment.  This is an integer value.  
There is
  a global Lucene property named CS_ROOT which stores all the 
content.
  The segments are just flat files with pointers.  Segments are 
broken
  into logical pieces by time and size.  Usually 100M of content 
would be
  in one segment.

- CS_OFFSET
  The byte offset of the field.
- CS_LENGTH
  The length of the field.
- CS_CHECKSUM
  Optional checksum to verify that the content is correct when 
fetched
  from the index.

- The field value here would be exactly 'N:O:L' where N is the segment 
number,
  O is the offset, and L is the length.  O and L are 64bit values.  N 
is a 32
  bit value (though 64bit wouldn't really hurt).

This mechanism allows for the external storage of any named field.
 
CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO 
code for
efficient content lookup.  (Though filehandle caching should probably 
be used).

Since content is broken into logical 100M segments the underlying 
filesystem can
orgnize the file into contiguous blocks for efficient non-fragmented 
lookup.

File manipulation is easy and indexes can be merged by simply 
concatenating the
second file to the end of the first.  (Though the segment, offset, and 
length
need to be updated).  (FIXME: I think I need to think about this more 
since I
will have  100M per syncs)

Supporting full unicode is important.  Full java.lang.String storage 
is used
with String.getBytes() so we should be able to avoid unicode issues.  
If Java
has a correct java.lang.String representation it's possible easily add 
unicode
support just by serializing the byte representation. (Note that the 
JDK says
that the DEFAULT system char encoding is used so if this is ever 
changed it
might break the index)

While Linux and modern versions of Windows (not sure about OSX) 
support 64bit
filesystems the 4G storage boundary of 32bit filesystems (ext2 is an 
example)
are an issue.  Using smaller indexes can prevent this but eventually 
segment
lookup in the filesystem will be slow.  This will only happen within 
terabyte
storage systems so hopefully the developer has migrated to another 
(modern)
filesystem such as XFS.

** FEATURES **
  - Must be able to replicate indexes easily to other hosts.
  - Adding content to the index must be CHEAP
  - Deletes need to be cheap (these are cheap for older content.  Just 
discard
older indexes)

  - Filesystem needs to be able to optimize storage
  - Must support UNICODE and binary content (images, blobs, byte arrays,
serialized objects, etc)
  - Filesystem metadata operations should be fast.  Since content is 
kept in
LARGE indexes this is easy to avoid.

  - Migration to the new system from legacy indexes should be fast and
painless for future developers
 
 


--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 

Re: Lucene external field storage contribution.

2004-11-07 Thread Giulio Cesare Solaroli
Hi Kevin


On Sun, 07 Nov 2004 13:47:10 -0800, Kevin A. Burton
[EMAIL PROTECTED] wrote:
 About 3 months ago I developed a external storage engine which ties into
 lucene.
 
 I'd like to discuss making a contribution so that this is integrated
 into a future version of Lucene.
 
 I'm going to paste my original PROPOSAL in this email.
 
 There wasn't a ton of feedback first time around but I figure squeaky
 wheel gets the grease...

You should probably better post this kind of message to the developer
list to get more feedback.

Anyway your suggestion makes a lot of sense.

I would like to suggest a further extensions: storing TermVectors on
their own separate segment(s).

The rationale for this feature is a carbon copy of your original post. :-]

Hope to see your patch in one of the next release.

Giulio Cesare Solaroli

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]