Grant Ingersoll wrote: > > Some randomly pieced together thoughts (I may not even be fully awake > yet :-) so feel free to tell me I'm not understanding this correctly) > > My first thought was how is this different from just having a binary > field, but if I understand correctly it is to be stored in a separate file? > > Now you are proposing a faster storage mechanism for them, essentially, > since they are to be stored separately from the Documents themselves? > But the other key is they are all stored next to each other, right, so > the scan is a lot faster? >
Yes, scanning and skipping would be much faster, comparable to a posting list. In fact, what I'm proposing is a new kind of posting list. Since you mentioned the magic term "flexible indexing" already ;), let's take a look at http://wiki.apache.org/lucene-java/FlexibleIndexing. Here 4 kinds of posting lists are proposed: a. <doc>+ b. <doc, boost>+ c. <doc, freq, <position>+ >+ d. <doc, freq, <position, boost>+ >+ Today, we have c. and d. already. c. is the original Lucene format, and d. can be achieved by storing the boost as a payload. The new format I'm proposing actually covers a. and b. If you don't store a payload it's basically a binary posting list without freq and positions (a.). If you store the boost as a payload, then you have b. > I think one of the questions that will come up from users is when should > I use addMetadata and when should I use addField? Why make the > distinction to the user? Fields have always represented metadata, all I'd like to make a distinction because IMO these are two different use cases. Not necessarily in terms of functionality, but in terms of performance. You are right, you can store everything today as stored fields, but if you want to use e. g. a stored value for scoring, then performance is terrible. This is simply the nature of the store - it is optimized for returning all stored fields for a document. Even a FieldSelector doesn't help you too much, unless the docs contain very big fields that you don't want to return. The reason is that two random I/Os are necessary to find the stored fields of a document. Then only sequential I/O has to be performed. And the overhead of loading e. g. 10KB instead of 2KB is not big, much less than two random I/Os, I believe. Payloads are also much better in terms of cache utilization. Since they are stored next to each other, and if accessed frequently (in every search), then it's very likely that big portions of that posting list will be in the cache. So the answer to the question when to use a stored field and when to use a payload should be: use payloads when you access the data during query evaluation/scoring, use stored fields when you need the data to construct a search result from a hit. > fields, right? Perhaps in this way, if users were willing to commit to > fixed length fields for the first level, we could also make field > updating of these types of fields possible w/o having to reindex????? > Yes I was thinking the same. Just like norms. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]