Digy, That's probably a good idea. I need to clean up the code as it stands now and make sure the unit tests pass. I'm going to shoot for getting a patch out in the next couple hours.
Thanks, Christopher On Tue, May 24, 2011 at 2:47 PM, Digy <digyd...@gmail.com> wrote: > Hi Christopher, > > According to my experience, this kind of eMails eighter gets no response or > too many fluctuations with no result. > What about preparing a *small* "Proof of Concept" code that passes all unit > tests. > > DIGY > > -----Original Message----- > From: Christopher Currens [mailto:currens.ch...@gmail.com] > Sent: Wednesday, May 25, 2011 12:08 AM > To: lucene-net-dev@lucene.apache.org > Subject: [Lucene.Net] Adding proper System.IO.Stream support to Lucene.Net > > All, > > I've spent the past few days looking at what it would take to implement > proper streaming of data into and out of an index. Fortunately, it hasn't > proven very difficult at all, leaving me with a solution that works very > nicely. Now that I know it's possible, I wanted to discuss with the > community the best way to add this to the API. > > Currently, it's setup that a field can have a Stream value if its binary > (System.IO.Stream StreamValue()). I have plans to, wherever in Lucene a > byte[] is used, to replace it with streaming functions, internally. I > think > its a good idea to keep the byte[] BinaryValue() as it is, but essentially > replace it, by default, with a kind of lazy loading. In the current > version > of lucene, if a user were to open a document with a binary field, that > entire field will be loaded into memory. > > The idea behind replacing the internals of FieldsReader.cs by passing a > stream along instead of a byte[], is that people using the API to stream > the > data out will load no more into memory than they have to. People using the > byte[] BinaryValue() function to get the binary data will actually have > improved performance as well, as the byte array will be loaded when calling > the method, instead of the creation of the document. > > As a final note on binary data streaming, by streaming the data in, we > obviously can't support compression of those fields. The compression in > Lucene is poor anyway, as it's not compression that can be done in blocks, > it requires large amounts of memory as it needs all the data in memory to > do > the compression, which is also done in a separate byte array. However, an > ability I had briefly talked to Troy about in person, was the ability to > add > StreamFilters, so that data passed is filtered first by a compression > algorithm or such before its stored in the index. However, that doesn't > really apply directly to the lucene domain, but it does at least afford the > user the opportunity to be able to do that via streaming data into > lucene.net. > > I also want to add proper TextReader support to Lucene.Net. A large > difference between the Java and .NET versions of lucene is that the Java > version supports setting a field's value to a TextReader, that both > analyzes > and stores the data. Due to the fact that the TextReader in .Net doesn't > support resetting or seeking of the underlying stream, we can only analyze > the text in lucene, we can't store the field. > > A solution that comes to mind would be creating a util class, something > like SeekableTextReader, that inherits from TextReader that can be passed > to > the field, with special behavior that allows it to be reset, and thus both > analyzed and stored. Perhaps the largest downside to that solution, is in > order to keep the API the same while allowing it to be stored, it would > require fairly ugly checks like "if(reader is SeekableTextReader) //do > this". > > Perhaps a cleaner solution would be to add yet another value to the Field > class that allowed for a SeekableTextReader to be passed. This way has its > own downsides, in that now there are two methods that expect TextReaders, > one stores and one doesn't, seems rather confusing. But I suppose this is > why I was looking for the community's opinion in the first place. > > > The more comments about this the better. I think adding this could add > some > much needed functionality to Lucene, and start setting apart its > performance > from the Java version. > > > Thanks, > Christopher > >