Hi Christopher, According to my experience, this kind of eMails eighter gets no response or too many fluctuations with no result. What about preparing a *small* "Proof of Concept" code that passes all unit tests.
DIGY -----Original Message----- From: Christopher Currens [mailto:currens.ch...@gmail.com] Sent: Wednesday, May 25, 2011 12:08 AM To: lucene-net-dev@lucene.apache.org Subject: [Lucene.Net] Adding proper System.IO.Stream support to Lucene.Net All, I've spent the past few days looking at what it would take to implement proper streaming of data into and out of an index. Fortunately, it hasn't proven very difficult at all, leaving me with a solution that works very nicely. Now that I know it's possible, I wanted to discuss with the community the best way to add this to the API. Currently, it's setup that a field can have a Stream value if its binary (System.IO.Stream StreamValue()). I have plans to, wherever in Lucene a byte[] is used, to replace it with streaming functions, internally. I think its a good idea to keep the byte[] BinaryValue() as it is, but essentially replace it, by default, with a kind of lazy loading. In the current version of lucene, if a user were to open a document with a binary field, that entire field will be loaded into memory. The idea behind replacing the internals of FieldsReader.cs by passing a stream along instead of a byte[], is that people using the API to stream the data out will load no more into memory than they have to. People using the byte[] BinaryValue() function to get the binary data will actually have improved performance as well, as the byte array will be loaded when calling the method, instead of the creation of the document. As a final note on binary data streaming, by streaming the data in, we obviously can't support compression of those fields. The compression in Lucene is poor anyway, as it's not compression that can be done in blocks, it requires large amounts of memory as it needs all the data in memory to do the compression, which is also done in a separate byte array. However, an ability I had briefly talked to Troy about in person, was the ability to add StreamFilters, so that data passed is filtered first by a compression algorithm or such before its stored in the index. However, that doesn't really apply directly to the lucene domain, but it does at least afford the user the opportunity to be able to do that via streaming data into lucene.net. I also want to add proper TextReader support to Lucene.Net. A large difference between the Java and .NET versions of lucene is that the Java version supports setting a field's value to a TextReader, that both analyzes and stores the data. Due to the fact that the TextReader in .Net doesn't support resetting or seeking of the underlying stream, we can only analyze the text in lucene, we can't store the field. A solution that comes to mind would be creating a util class, something like SeekableTextReader, that inherits from TextReader that can be passed to the field, with special behavior that allows it to be reset, and thus both analyzed and stored. Perhaps the largest downside to that solution, is in order to keep the API the same while allowing it to be stored, it would require fairly ugly checks like "if(reader is SeekableTextReader) //do this". Perhaps a cleaner solution would be to add yet another value to the Field class that allowed for a SeekableTextReader to be passed. This way has its own downsides, in that now there are two methods that expect TextReaders, one stores and one doesn't, seems rather confusing. But I suppose this is why I was looking for the community's opinion in the first place. The more comments about this the better. I think adding this could add some much needed functionality to Lucene, and start setting apart its performance from the Java version. Thanks, Christopher