> No, actually, you can update index-only fields also. It all depends on the > operation that you ask to do on the fields.
I love the level of control this provides, but again, I was talking at the user level. > If you want to e.g. remove an entire field w/ such update operation, then it > becomes more expensive That's the typical usage scenario, I imagine. When I update a field, I want to update all of it, not just part of it. No? (The lower level semantics of twiddling with the postings is poorly understood by the typical user..) > We didn't > face such a scenario though, and I think it's probably a rare one? As rare as any time you want to update an indexed-only field. Not a serious limitation (but perhaps worth noting?) Perhaps at the API level, you provide an updateIndexedOnlyField that takes the old value as well as the new value for the field. Anyway, I think your approach would be a great addition to the toolkit. Would love to see it even in rough cut form :)) -Babak On Sun, May 9, 2010 at 12:49 AM, Shai Erera <ser...@gmail.com> wrote: > No, actually, you can update index-only fields also. It all depends on the > operation that you ask to do on the fields. For example, if the query to > execute is something like "update all documents w/ tags:ibm -> remove terms > t1, t2, t3 and add terms t4, t5", then the result of such request would > dissociate t1-3 from those docs that answer tags:ibm and associate them w/ > t4 and t5. Specifically, if docs 1, 4, 10 answer tags:ibm, then the > following posting updates will be done: > t1: -1, -4, -10 > t2: -1, -4, -10 > t3: -1, -4, -10 > t4, 1, 4, 10 > t5, 1, 4, 10 > (in addition to whatever other updates that are associated with those > postings). > > At search time, if you search for "t1 OR t2", then the regular t1 and t2 > postings will be merged on-the-fly w/ the updated ones to remove docs 1, 4, > 10. > > If you want to e.g. remove an entire field w/ such update operation, then it > becomes more expensive, but in general you'd need to iterate over the > field's terms and dissociate the documents from all the terms. We didn't > face such a scenario though, and I think it's probably a rare one? > > Shai > > On Sun, May 9, 2010 at 9:39 AM, Babak Farhang <farh...@gmail.com> wrote: >> >> Shai, >> >> I think this is an interesting approach. I can see how I could >> [incrementally] update a stored, indexed field this way, but I don't >> understand how I could update an indexed-only field. Here's why: for a >> stored (and indexed) field, I can always determine what terms to >> remove ('-') from the postings, but for an indexed-only field I'd have >> no [practical] way to know.. >> >> So under this approach, I'm thinking at a user level, a Lucene field >> would be updateable only if it's at least stored. >> >> Is that right? >> >> -Babak >> >> On Wed, May 5, 2010 at 11:17 AM, Shai Erera <ser...@gmail.com> wrote: >> > Yes Mike - I don't know yet if two MPs will be used, one for the stacked >> > segments and one for the general segments (which will include the >> > stacked >> > ones as well) .. feels like one MP should be enough, but this can be >> > decided >> > on as we progress. >> > >> > This approach allows you to update every term in every already indexed >> > field, as well as add terms to already indexed fields ... and add >> > totally >> > new fields, with lots of text in them. So e.g. there are two neat use >> > cases >> > that come to mind: >> > 1) Allow users to rate search results, favor them etc. >> > 2) Or even comment them, >> > I think Google offers the 2nd. Both translate into updating the search >> > result's already indexed document w/ the new rating, comment etc. w/o >> > needing to reindex the document. >> > >> > I will try to find perf results numbers. It's been long time ago, hope >> > all >> > the documents are still where they were :). Indeed, for terms like ACLs, >> > it >> > means that each query had to merge dozens of postings lists. For that I >> > implemented an alternative solution, which uses a payload-like structure >> > that registers for each document the list of ACLs that are associated >> > with >> > it (not as strings, it was more efficient). Then, if the query included >> > dozens of such terms, I created a filter-like object which for every >> > matching document by the query checked if it matches the ACLs list of >> > the >> > document. This is usually slower, because the ACLs themselves don't >> > drive >> > the query, which means more matches will be found. That was a tradeoff >> > which >> > one could configure based on the number of such terms in the query, the >> > number of updated terms etc. >> > >> > But in essence you're right - if the solution is generic to cover any >> > term, >> > we cannot use such payload-based feature. We might need to merge the >> > stacked >> > segments more frequently. This is pending perf testing though. >> > >> > Shai >> > >> > On Wed, May 5, 2010 at 6:54 PM, Michael McCandless >> > <luc...@mikemccandless.com> wrote: >> >> >> >> Catching up here :) >> >> >> >> This is great stuff Shai -- I like the notion of "negative" postings, >> >> that "subtract" docs from previous generations as you iterate them. >> >> >> >> And I like the term "stacked segments". This fits very well with >> >> Lucene's write-once approach, ie, a writer can at any time stack a new >> >> segment (writes to new files) "over" an old segment, like the layers >> >> in photoshop. A reader merges all stacks on-the-fly when reading. >> >> >> >> And the merge policy now picks from 2 dimensions right? Ie it may >> >> want to simply consolidate stacks on an old segment but otherwise not >> >> merge that segment (eg for very large segments that have accumulated >> >> updates), and normal merging would of course consolidate all stacks >> >> for all segments merged. >> >> >> >> Wouldn't this approach conceivably allow you to alter single terms >> >> within a single field (we'd have to figure out how we'd expose the API >> >> for this)? EG if I appended some text to an already-indexed field? >> >> (In addition to adding a new field to an already indexed doc, or >> >> replacing an indexed field on a previously indexed doc). >> >> >> >> Did you have any hard perf numbers? Merge sorting N streams is >> >> surprisingly costly... we may need/want to have a reader pre-merge >> >> (using up RAM) any "long tail" of stacked segments as long as they are >> >> small enough... >> >> >> >> This sounds great!! >> >> >> >> Mike >> >> >> >> On Sun, Apr 25, 2010 at 7:32 AM, Shai Erera <ser...@gmail.com> wrote: >> >> > Hi, >> >> > >> >> > WARNING: following email is a bit long, but I think is worth the >> >> > reading >> >> > :) >> >> > >> >> > I would like to describe my implementation of incremental field >> >> > updates >> >> > in Juru (the former search library I've worked on in IBM). I will >> >> > later >> >> > discuss how I think it can be implemented in Lucene. >> >> > >> >> > The motivation/requirement came from a doc management system which >> >> > used >> >> > Juru as its search component. The system included document libraries >> >> > where users could create files and upload documents. A user could >> >> > belong >> >> > to any number of libraries and complex ACLs model was used (down to >> >> > the >> >> > level of the file). ACLs and Folders were modeled as categories in >> >> > the >> >> > index (boolean-like terms). Files and folders could be moved around >> >> > and >> >> > access to a library/folder/file could be granted/revoked at any given >> >> > time. Therefore, such updates usually affected hundreds (and >> >> > thousands) >> >> > of documents. Overall, the index managed millions of documents, tens >> >> > of >> >> > thousands of libraries and hundreds of thousands of ACLs (large >> >> > organization :)). To get a rough understanding on the number of such >> >> > updates - every several minutes, tens of thousands of documents were >> >> > updated due to such changes only (in addition to the regular content >> >> > updates). >> >> > >> >> > We were asked to support requests in the following form: "update all >> >> > docs >> >> > that match <criteria> --> do <operation>" where: >> >> > * <criteria> was "a single doc", "docs belonging to a category" and >> >> > "docs >> >> > belonging to a set of categories". >> >> > * <operation> was "add categories NEW" (NEW might not even exist in >> >> > the >> >> > index yet, or already associated w/ the document), "remove categories >> >> > OLD" >> >> > (irregardless if the docs were already associated w/ OLD or not) and >> >> > "remove all OLD and add all NEW". >> >> > >> >> > The solution I've implemented to support these requests turned out to >> >> > actually allow you to update every term (!) in the index: suppose >> >> > that >> >> > you have a table, which recorded tuples like <docid, term, +/->. The >> >> > record <1, "ibm", '+'> means that doc 1 is associated w/ the term >> >> > "ibm", >> >> > and the record <1, "hp", '-'> means that the same document is not >> >> > associated w/ the word "hp". Then, you could very easily ask for all >> >> > documents that are assoicated w/ "hp", and the result would not >> >> > include >> >> > doc 1. Note that docid=1 is not the app Doc_ID, but the internal id >> >> > the >> >> > document received. >> >> > >> >> > I've kept two types of postings in the index: regular and updates. >> >> > Taking the above examples, "ibm" regular posting looked like <"ibm", >> >> > 1, >> >> > 3, 1, 2, 5 ...> (dgaps) and the updates posting looked like <"ibm", >> >> > +2, >> >> > -3, +6, +10 ...> (absolute docid value, w/ a +/- sign). Similarly for >> >> > "hp". >> >> > >> >> > During search time, when a query with the word "ibm" was submitted, I >> >> > create a virtual posting which reads from both the regular and the >> >> > updates, and merges them on the fly according to the +/- signs. Since >> >> > both postings are sorted in ascending order, the merge is very >> >> > efficient, and query time is hardly affected. >> >> > >> >> > Those postings are merged from time to time in a process that is >> >> > similar >> >> > to how Lucene works today, which keeps the update postings relatively >> >> > small and manageable. >> >> > >> >> > Now here comes the fun part - how I think it can be implemented in >> >> > Lucene ! >> >> > >> >> > To be honest, this sat on my TODO list for a very long time and only >> >> > a >> >> > couple of months ago I figured out how to implement it in Lucene. The >> >> > main difficulty I had was around the difference between the >> >> > write-once >> >> > policy in Juru and Lucene - in Lucene, once a segment is written, it >> >> > cannot be changed. BUT, I've only recently realized that this isn't >> >> > exactly true, because deleted docs do change existing segments. The >> >> > deletes are kept in a separate file to the segment (.del) and have >> >> > their >> >> > own generation. Deletes, as I understood then, and Grant helped me >> >> > term >> >> > them better, can be defined as "Stacked Segments" - they add data to >> >> > a >> >> > segment, which from time to time are integrated into the segment >> >> > (unlike >> >> > Photoshop Layers, but my understanding of Photoshop is limited). And >> >> > the >> >> > Lucene engine knows how to combine the two, giving precedence to the >> >> > deletes. >> >> > >> >> > By introducing an "Updates Stacked Segment", we can encode postings >> >> > w/ >> >> > the '+'/'-' signs, and when TermDocs/Positions is requested, we can >> >> > create a variation which merges the two lists. When segments are >> >> > merged, >> >> > the updates will be merged into the regular postings (just like >> >> > deletes) >> >> > and thus will be gone. In addition, this plays very nicely with >> >> > readers >> >> > that are currently reading the index, as well as we can have >> >> > generations >> >> > for the updates - really like deletes ! >> >> > >> >> > I think that Lucene's architecture allows for such a solution very >> >> > cleanly and nicely (and I believe flex makes it even easier). We can >> >> > (later, after you've digested the idea) discuss whether this should >> >> > be >> >> > built into the current IW, or an extension like UpdateableIW. The API >> >> > I've been thinking about should really be like deletes, allowing to >> >> > update docs based on Term or Query. I defer the API discussion for >> >> > later >> >> > for now. >> >> > >> >> > As for stored fields, this was a real challenge to support in Juru, >> >> > but >> >> > I think that w/ "Stacked Segments" and Lucene's architecture, this >> >> > should >> >> > be much easier - adding stacked stored fields ... >> >> > >> >> > As you've noticed, the update postings are not DGap encoded, and sign >> >> > needs to be preserved. While I haven't implemented it in Juru, I >> >> > think >> >> > that perhaps this can be improved by keeping the '-' and '+' lists >> >> > separated. We will need to register somewhere which came before which >> >> > because order matters a lot here (and I'm not talking about >> >> > concurrency >> >> > - simple update instructions order). I have some idea how this can be >> >> > achieved, but I refrain from describing it now, to not make this >> >> > email >> >> > even longer :). >> >> > >> >> > I've mentioned that this approach can be applied to any term and not >> >> > just categories under some circumstances. Basically, as soon as you >> >> > update a term, its DF is no longer true, unless you are able to take >> >> > the >> >> > updates into account. We can defer the discussion on that, but >> >> > clearly >> >> > for many fields, incrementally update them should not affect >> >> > precision, >> >> > as they're not used for that type of scoring ... Maybe, by keeping >> >> > separate '+' and '-' lists we can compute statistics precisely. And I >> >> > haven't given much thought yet to how this and Mike's flex scoring >> >> > will >> >> > be integrated. >> >> > >> >> > BTW, a word on Parallel Indexing - the two are completely orthogonal. >> >> > Once PI is introduced, one can index all the updateable fields in a >> >> > dedicated slice, for perhaps improving search performance for slices >> >> > that are not updateable (not involving code which attempts to read >> >> > and >> >> > merge update and regular lists on the fly). Also, incremental field >> >> > updates support all of PI's scenarios, even though some will be done >> >> > more efficiently w/ PI. But this too is a matter for a separate >> >> > discussion :). >> >> > >> >> > That's it ! I believe I've given you all the details I have about the >> >> > approach and high level proposed solution for Lucene. Perhaps some >> >> > details slipped my mind, but if you ask the right questions, I'm sure >> >> > I'll be able to answer them :). I would like to emphasize that since >> >> > this was already implemented (in Juru) - this is more than just a "I >> >> > think this approach can work" proposal ... >> >> > >> >> > I would appreciate your comments on this. I would like to start >> >> > implementing it soon, and so as a first step, please share your >> >> > comments >> >> > on the overall approach. I'll then write a more detailed description >> >> > on >> >> > how I think to impl it in Lucene (been spending some time on that), >> >> > and >> >> > we can have more detailed (and fun) discussions on the low level >> >> > details. >> >> > >> >> > Shai >> >> > >> >> > On Fri, Apr 9, 2010 at 5:05 AM, Babak Farhang <farh...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Good point. I meant the model at the document level: i.e. what >> >> >> milestones does a document go through in its life cycle. Today: >> >> >> >> >> >> created --> deleted >> >> >> >> >> >> With incremental updates: >> >> >> >> >> >> created --> update1 --> update2 --> deleted >> >> >> >> >> >> I think what I'm trying to say is that this second threaded sequence >> >> >> of state changes seems intuitively more fragile under concurrent >> >> >> scenarios. So for example, in a lock-free design, the system would >> >> >> also have to anticipate the following sequence of events: >> >> >> >> >> >> created --> update1 --> deleted --> update2 >> >> >> >> >> >> and consider update2 a null op. I'm imagining there are other cases >> >> >> that I can't think of.. >> >> >> >> >> >> -Babak >> >> >> >> >> >> >> >> >> On Tue, Apr 6, 2010 at 3:40 AM, Michael McCandless >> >> >> <luc...@mikemccandless.com> wrote: >> >> >> > write once, plus the option to the app to keep multiple commit >> >> >> > points >> >> >> > around (by customizing the deletion policy). >> >> >> > >> >> >> > Actually order of operations / commits very much matters in Lucene >> >> >> > today. >> >> >> > >> >> >> > Deletions are not idempotent: if you add a doc w/ term X, delete >> >> >> > by >> >> >> > term X, add a new doc with term X... that's very different than if >> >> >> > you >> >> >> > moved the delete op to the end. Ie the deletion only applies to >> >> >> > the >> >> >> > docs added before it. >> >> >> > >> >> >> > Mike >> >> >> > >> >> >> > On Mon, Apr 5, 2010 at 12:45 AM, Babak Farhang <farh...@gmail.com> >> >> >> > wrote: >> >> >> >> Sure. Because of the write once principle. But at some cost >> >> >> >> (duplicated data). I was just agreeing that it would not be a >> >> >> >> good >> >> >> >> idea to bake in version-ing by keeping the layers around forever >> >> >> >> in >> >> >> >> a >> >> >> >> merged index; I wasn't keying in on transactions per se. >> >> >> >> >> >> >> >> Speaking of transactions: I'm not sure if we should worry about >> >> >> >> this >> >> >> >> much yet, but with "updates" the order of the transaction commits >> >> >> >> seems important. I think commit order is less important today in >> >> >> >> Lucene because its model supports only 2 types of events: >> >> >> >> document >> >> >> >> creation--which only happens once, and document deletion, which >> >> >> >> is >> >> >> >> idempotent. What do you think? Will commits have to be ordered >> >> >> >> if >> >> >> >> we >> >> >> >> introduce updates? Or does the onus of maintaining order fall on >> >> >> >> the >> >> >> >> application? >> >> >> >> >> >> >> >> -Babak >> >> >> >> >> >> >> >> On Sat, Apr 3, 2010 at 3:28 AM, Michael McCandless >> >> >> >> <luc...@mikemccandless.com> wrote: >> >> >> >>> On Sat, Apr 3, 2010 at 1:25 AM, Babak Farhang >> >> >> >>> <farh...@gmail.com> >> >> >> >>> wrote: >> >> >> >>>>> I think they get merged in by the merger, ideally in the >> >> >> >>>>> background. >> >> >> >>>> >> >> >> >>>> That sounds sensible. (In other words, we wont concern >> >> >> >>>> ourselves >> >> >> >>>> with >> >> >> >>>> roll backs--something possible while a "layer" is still >> >> >> >>>> around.) >> >> >> >>> >> >> >> >>> Actually roll backs would still be very possible even if layers >> >> >> >>> are >> >> >> >>> merged. >> >> >> >>> >> >> >> >>> Ie, one could keep multiple commits around, and the older >> >> >> >>> commits >> >> >> >>> would still be referring to the old postings + layers, keeping >> >> >> >>> them >> >> >> >>> alive. >> >> >> >>> >> >> >> >>> Lucene would still be transactional with such an approach. >> >> >> >>> >> >> >> >>> Mike >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> --------------------------------------------------------------------- >> >> >> >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> >> >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > >> >> >> > --------------------------------------------------------------------- >> >> >> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> >> > For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> >> >> > >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org