Done. I missed one more point
> > Also, it would be awesome if Blur supports a per-row auto-complete feature. > Not sure what you mean. Are you talking about in the shell? I was referring to auto-complete/fill during search like in google/gmail, but in our case we may need to tailor it per-row, instead of global suggestions. It has to be exposed as a thrift-API -- Ravi On Fri, Oct 11, 2013 at 6:21 PM, Aaron McCurry <[email protected]> wrote: > I think that's a good idea. I like the plan to make it an option. Could > you go to issue https://issues.apache.org/jira/browse/BLUR-220 and either > link to this thread. Or add a comment to the issue with your thoughts? > Thanks! > > Aaron > > > On Fri, Oct 11, 2013 at 2:47 AM, Ravikumar Govindarajan < > [email protected]> wrote: > > > Ah, that explains it I guess. This Block indexing of all records of a row > > should be an option. It will have big costs for online indexing. > > > > Lets take the case of gmail itself. A user will have > hundreds-of-thousands > > of e-mails and every day 10-15 mails at different time intervals, will be > > added to the corpus > > > > Scattering records across segments and taking a minor hit during search, > > will be the preferred choice right? > > > > As a compensation, we can use a SortingMergePolicy as documented at > > https://issues.apache.org/jira/browse/LUCENE-4752 > > > > We can co-locate all records of a given row during merge across > > participating segments. This will offset the performance loss to a good > > extent > > > > What do you think? > > > > -- > > Ravi > > > > > > On Fri, Oct 11, 2013 at 6:12 AM, Aaron McCurry <[email protected]> > wrote: > > > > > On Thu, Oct 10, 2013 at 6:47 AM, Ravikumar Govindarajan < > > > [email protected]> wrote: > > > > > > > I saw this JIRA on humungous rows and got quite confused on the > > > UPDATE_ROW > > > > operation. > > > > > > > > https://issues.apache.org/jira/browse/BLUR-220 > > > > > > > > Lets say I add 2 records to a row, whose existing records number in > > > > hundreds-of-thousands. > > > > > > > > Will Blur attempt to first read all these records before adding the > > > > incoming 2 records? > > > > > > > > > > It has to right now. > > > > > > > > > > > > > > What, if we just expose simple record-add/delete on a row, without > > > fetching > > > > the row at all? > > > > > > > > > > The problem is that the internal query class is built to only support > > > records (documents) that are indexed together as a single block, > within a > > > single segment. It is very performant for reads and searches, but as > the > > > row grows in size it becomes very costly. > > > > > > One idea I had was to detect when rows are hot (being updated a lot) or > > > they are too large and move them into there own indexes. For the hot > > rows, > > > once they cool off they could be merged back in with the regular rows > in > > > the main index. > > > > > > > > > > > > > > It should be quite quick and highly useful, at least for apps already > > > using > > > > lucene. > > > > > > > > > > Agreed, that's what that issue is meant to solve. > > > > > > > > > > > > > > -- > > > > Ravi > > > > > > > > > > > > On Wed, Oct 9, 2013 at 11:27 AM, Ravikumar Govindarajan < > > > > [email protected]> wrote: > > > > > > > > > Yes, I think bringing in a mutable file in lucene-index brings it's > > own > > > > > set of problems to handle. Filters, Caches, Scoring, > > Snapshots/Commits > > > > > etc... will all be affected. > > > > > > > > > > There is on JIRA on writing generation of updatable files, just > like > > > > > doc-deletes instead of over-writing a single file.[ > > > > > https://issues.apache.org/jira/browse/LUCENE-4258]. But that is > > still > > > > > in-progress and from what I understand, it could slow searches > > > > considerably. > > > > > > > > > > BTW, is it possible to extend BlurPartitioner and load it during > > > > start-up? > > > > > > > > > > Also, it would be awesome if Blur supports a per-row auto-complete > > > > feature. > > > > > > > > > > -- > > > > > Ravi > > > > > > > > > > > > > > > On Sat, Oct 5, 2013 at 2:01 AM, Aaron McCurry <[email protected]> > > > > wrote: > > > > > > > > > >> I have thought of one possible problem with this approach. To > date > > > the > > > > >> mindset I have used in all of the Blur internals is that segments > > are > > > > >> immutable. This is a fundamental principle that Blur uses and I > > don't > > > > >> really have any ideas on where to behind checking for when this > is a > > > > >> problem. I know filters are going to be an issue, not sure where > > > else. > > > > >> > > > > >> Not saying that it can't be done, it's just not going to be as > clean > > > as > > > > I > > > > >> originally thought. > > > > >> > > > > >> Aaron > > > > >> > > > > >> > > > > >> On Fri, Oct 4, 2013 at 4:26 PM, Aaron McCurry <[email protected] > > > > > > wrote: > > > > >> > > > > >> > > > > > >> > > > > > >> > On Fri, Oct 4, 2013 at 7:15 AM, Ravikumar Govindarajan < > > > > >> > [email protected]> wrote: > > > > >> > > > > > >> >> On a related note, do you think such an approach will fit in > Blur > > > > >> >> > > > > >> >> 1. Store the BDB file in shard-server itself. > > > > >> >> > > > > >> > > > > > >> > Probably not, this would pin the BDB (or whatever the solution > > would > > > > be) > > > > >> > to a specific server. We will have to sync to HDFS. > > > > >> > > > > > >> > > > > > >> >> > > > > >> >> 2. Apply all incoming partial doc-updates to local BDB file as > > well > > > > as > > > > >> an > > > > >> >> update-transaction log > > > > >> >> > > > > >> > > > > > >> > Blur already has a write ahead log as apart of internals. It's > > > > written > > > > >> > and synced to HDFS. > > > > >> > > > > > >> > > > > > >> >> > > > > >> >> 3. Periodically sync dirty BDB files to HDFS and roll-over the > > > > update- > > > > >> >> transaction log. > > > > >> > > > > > >> > > > > > >> >> Whenever a shard-server goes down, the take-over server can > > > initially > > > > >> sync > > > > >> >> the BDB file from HDFS to local, replay the update-transaction > > log > > > > and > > > > >> >> then > > > > >> >> start serving data > > > > >> >> > > > > >> > > > > > >> > Blur already does this internally, it records the mutates and > > > replays > > > > >> them > > > > >> > if a failure happens before a commit. > > > > >> > > > > > >> > Aaron > > > > >> > > > > > >> > > > > > >> >> > > > > >> >> -- > > > > >> >> Ravi > > > > >> >> > > > > >> >> > > > > >> >> On Thu, Oct 3, 2013 at 11:14 PM, Ravikumar Govindarajan < > > > > >> >> [email protected]> wrote: > > > > >> >> > > > > >> >> > The mutate APIs are a good fit for individual cols update. > > > > BlurCodec > > > > >> >> will > > > > >> >> > be cool and solve a lot of problems. > > > > >> >> > > > > > >> >> > There are 3 caveats for such a codec > > > > >> >> > > > > > >> >> > 1. Scores for affected queries will be wrong, until > > segment-merge > > > > >> >> > > > > > >> >> > 2. Responsibility of ordering updates must be on the client. > > > > >> >> > > > > > >> >> > 3. Repeated updates for the same document can either take a > > > > >> generational > > > > >> >> > approach [Lucene-4258] or use a single version of storage > > > [Redis/TC > > > > >> >> etc..], > > > > >> >> > pushing the onus to client, depending on how the Codec shapes > > up. > > > > >> >> > > > > > >> >> > The former will be semantically correct but really sluggish > > while > > > > the > > > > >> >> > latter will be faster during search > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > On Thu, Oct 3, 2013 at 8:53 PM, Aaron McCurry < > > > [email protected]> > > > > >> >> wrote: > > > > >> >> > > > > > >> >> >> On Thu, Oct 3, 2013 at 11:08 AM, Ravikumar Govindarajan < > > > > >> >> >> [email protected]> wrote: > > > > >> >> >> > > > > >> >> >> > Yeah, you are correct. A BDB file might probably never be > > > ported > > > > >> to > > > > >> >> >> HDFS. > > > > >> >> >> > > > > > >> >> >> > Our daily update frequency comes to about 20% of insertion > > > rate. > > > > >> >> >> > > > > > >> >> >> > Lets say "UPDATE <TABLE> SET COL2=1 WHERE COL1=X". > > > > >> >> >> > > > > > >> >> >> > This update could potentially span across tens of > thousands > > of > > > > SQL > > > > >> >> rows > > > > >> >> >> in > > > > >> >> >> > our case, where COL2 is just a boolean flip. > > > > >> >> >> > > > > > >> >> >> > The problem is not with lucene's ability to handle load. > > > Instead > > > > >> it > > > > >> >> is > > > > >> >> >> with > > > > >> >> >> > the consistent load it puts on our content servers to read > > and > > > > >> >> >> re-tokenize > > > > >> >> >> > such huge rows just for a boolean flip. Another big winner > > is > > > > that > > > > >> >> all > > > > >> >> >> our > > > > >> >> >> > updatable fields are not involved in scoring at all. Just > > > > matching > > > > >> >> will > > > > >> >> >> do. > > > > >> >> >> > > > > > >> >> >> > The changes also sit in BDB only till the next segment > > merge, > > > > >> after > > > > >> >> >> which > > > > >> >> >> > it is cleaned out. There is very little perf hit here for > > us, > > > as > > > > >> >> users > > > > >> >> >> > don't immediately search after a change. > > > > >> >> >> > > > > > >> >> >> > I am afraid there is no documentation/code/numbers on this > > > > >> currently > > > > >> >> in > > > > >> >> >> > public, as it is still proprietary but is remarkably > similar > > > to > > > > >> the > > > > >> >> >> popular > > > > >> >> >> > to RedisCodec. > > > > >> >> >> > > > > > >> >> >> > "If you really need partial document updates, there would > > need > > > > to > > > > >> be > > > > >> >> >> > changes > > > > >> >> >> > throughout the entire stack" > > > > >> >> >> > > > > > >> >> >> > You mean, the entire stack of Blur? In case this is > > possible, > > > > can > > > > >> you > > > > >> >> >> give > > > > >> >> >> > me 10000-ft overview of what you have in mind? > > > > >> >> >> > > > > > >> >> >> > > > > >> >> >> Interesting, now that I think about it. The situation that > > you > > > > >> >> describe > > > > >> >> >> is > > > > >> >> >> very interesting, I'm wondering if we came up with something > > > like > > > > >> this > > > > >> >> in > > > > >> >> >> Blur that it would fix our large Row issue. Or at the very > > > least > > > > >> help > > > > >> >> the > > > > >> >> >> problem. > > > > >> >> >> > > > > >> >> >> https://issues.apache.org/jira/browse/BLUR-220 > > > > >> >> >> > > > > >> >> >> Plus the more I think about it, the mutate methods are > > probably > > > > the > > > > >> >> right > > > > >> >> >> implementation for modifying single columns. So the API of > > Blur > > > > >> >> probably > > > > >> >> >> wouldn't need to be changed. Maybe just the way it goes > about > > > > >> dealing > > > > >> >> >> with > > > > >> >> >> changes. I thinking maybe we need our own BlurCodec to > handle > > > > large > > > > >> >> Rows > > > > >> >> >> as well as Record (Document) updates. > > > > >> >> >> > > > > >> >> >> As an aside I constantly am having to refer to Records as > > > > Documents, > > > > >> >> this > > > > >> >> >> is why I think we need a rename. > > > > >> >> >> > > > > >> >> >> Aaron > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > > >> >> >> > -- > > > > >> >> >> > Ravi > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> > On Thu, Oct 3, 2013 at 5:36 PM, Aaron McCurry < > > > > [email protected] > > > > >> > > > > > >> >> >> wrote: > > > > >> >> >> > > > > > >> >> >> > > The biggest issue with this is that the shards (the > > indexes) > > > > >> >> inside of > > > > >> >> >> > Blur > > > > >> >> >> > > actually move from one server to another. So to support > > > this > > > > >> >> behavior > > > > >> >> >> > all > > > > >> >> >> > > the indexes are stored in HDFS. Do due the differences > > > > between > > > > >> >> HDFS > > > > >> >> >> and > > > > >> >> >> > > the a normal POSIX file system, I highly doubt that the > > BDB > > > > file > > > > >> >> form > > > > >> >> >> in > > > > >> >> >> > > TokyoCabinet can ever be supported. > > > > >> >> >> > > > > > > >> >> >> > > If you really need partial document updates, there would > > > need > > > > >> to be > > > > >> >> >> > changes > > > > >> >> >> > > throughout the entire stack. I am curious why you need > > this > > > > >> >> feature? > > > > >> >> >> Do > > > > >> >> >> > > you have that many updates to the index? What is the > > update > > > > >> >> >> frequency? > > > > >> >> >> > > I'm just curious of what kind of performance you get > out > > > of a > > > > >> >> setup > > > > >> >> >> like > > > > >> >> >> > > that? Since I haven't ever run such a setup I have no > > idea > > > > how > > > > >> to > > > > >> >> >> > compare > > > > >> >> >> > > that kind of system to a base Lucene setup. > > > > >> >> >> > > > > > > >> >> >> > > Could you point be to some code or documentation? I > would > > > to > > > > go > > > > >> >> and > > > > >> >> >> > take a > > > > >> >> >> > > look. > > > > >> >> >> > > > > > > >> >> >> > > Thanks, > > > > >> >> >> > > Aaron > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > On Thu, Oct 3, 2013 at 7:00 AM, Ravikumar Govindarajan < > > > > >> >> >> > > [email protected]> wrote: > > > > >> >> >> > > > > > > >> >> >> > > > One more help. > > > > >> >> >> > > > > > > > >> >> >> > > > We also maintain a file by name "BDB", just like the > > > > "Sample" > > > > >> >> file > > > > >> >> >> for > > > > >> >> >> > > > tracing used by Blur. > > > > >> >> >> > > > > > > > >> >> >> > > > This "BDB" file pertains to TokyoCabinet and is used > > > purely > > > > >> for > > > > >> >> >> > > supporting > > > > >> >> >> > > > partial updates to a document. > > > > >> >> >> > > > All operations on this file rely on local file-paths > > only, > > > > >> >> through > > > > >> >> >> the > > > > >> >> >> > > use > > > > >> >> >> > > > of native code. > > > > >> >> >> > > > Currently, all update requests are local to the index > > > files > > > > >> and > > > > >> >> it > > > > >> >> >> > > becomes > > > > >> >> >> > > > trivial to support. > > > > >> >> >> > > > > > > > >> >> >> > > > Any pointers on how to take this forward in Blur > set-up > > of > > > > >> >> >> > shard-servers > > > > >> >> >> > > & > > > > >> >> >> > > > controllers? > > > > >> >> >> > > > > > > > >> >> >> > > > -- > > > > >> >> >> > > > Ravi > > > > >> >> >> > > > > > > > >> >> >> > > > > > > > >> >> >> > > > On Tue, Oct 1, 2013 at 10:15 PM, Aaron McCurry < > > > > >> >> [email protected]> > > > > >> >> >> > > wrote: > > > > >> >> >> > > > > > > > >> >> >> > > > > You can control the fields to warmup via: > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > > > > > http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_TableDescriptor > > > > >> >> >> > > > > > > > > >> >> >> > > > > The preCacheCols field. The comment is wrong > however, > > > so > > > > I > > > > >> >> will > > > > >> >> >> > > create a > > > > >> >> >> > > > > task to correct. The use of the field is: > > > "family.column" > > > > >> just > > > > >> >> >> like > > > > >> >> >> > > you > > > > >> >> >> > > > > would search. > > > > >> >> >> > > > > > > > > >> >> >> > > > > Aaron > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > On Tue, Oct 1, 2013 at 12:41 PM, Ravikumar > > Govindarajan > > > < > > > > >> >> >> > > > > [email protected]> wrote: > > > > >> >> >> > > > > > > > > >> >> >> > > > > > Thanks Aaron > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > General sampling and warming is fine and the code > is > > > > >> really > > > > >> >> >> concise > > > > >> >> >> > > and > > > > >> >> >> > > > > > clear. > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > The act of reading > > > > >> >> >> > > > > > brings the data into the block cache and the > result > > is > > > > >> that > > > > >> >> the > > > > >> >> >> > index > > > > >> >> >> > > > is > > > > >> >> >> > > > > > "hot". > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > Will all the terms of a field be read and brought > > into > > > > the > > > > >> >> >> cache? > > > > >> >> >> > If > > > > >> >> >> > > > so, > > > > >> >> >> > > > > > then it has an obvious implication to avoid fields > > > like, > > > > >> say > > > > >> >> >> > > > > > attachment-data from warming up, provided queries > > > don't > > > > >> often > > > > >> >> >> > include > > > > >> >> >> > > > > such > > > > >> >> >> > > > > > fields > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > On Tue, Oct 1, 2013 at 7:58 PM, Aaron McCurry < > > > > >> >> >> [email protected]> > > > > >> >> >> > > > > wrote: > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > Take a look at this package. > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > > > > > https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=tree;f=blur-store/src/main/java/org/apache/blur/lucene/warmup;h=f4239b1947965dc7fe8218eaa16e3f39ecffdda0;hb=apache-blur-0.2 > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > Basically when the warmup process starts (which > is > > > > >> >> >> asynchronous > > > > >> >> >> > to > > > > >> >> >> > > > the > > > > >> >> >> > > > > > rest > > > > >> >> >> > > > > > > of the application) it flips a thread local > switch > > > to > > > > >> allow > > > > >> >> >> for > > > > >> >> >> > > > tracing > > > > >> >> >> > > > > > of > > > > >> >> >> > > > > > > the file accesses. The sampler will sample each > > of > > > > the > > > > >> >> >> fields in > > > > >> >> >> > > > each > > > > >> >> >> > > > > > > segment and create a sample file that attempts > to > > > > detect > > > > >> >> the > > > > >> >> >> > > > boundaries > > > > >> >> >> > > > > > of > > > > >> >> >> > > > > > > each field within each file within each segment. > > > Then > > > > >> it > > > > >> >> >> stores > > > > >> >> >> > > the > > > > >> >> >> > > > > > sample > > > > >> >> >> > > > > > > info into the directory beside each segment (so > > that > > > > >> way it > > > > >> >> >> > doesn't > > > > >> >> >> > > > > have > > > > >> >> >> > > > > > to > > > > >> >> >> > > > > > > re-sample the segment). After the sampling is > > > > complete > > > > >> or > > > > >> >> >> > loaded, > > > > >> >> >> > > > the > > > > >> >> >> > > > > > > warmup just reads the binary data from each > file. > > > The > > > > >> act > > > > >> >> of > > > > >> >> >> > > reading > > > > >> >> >> > > > > > > brings the data into the block cache and the > > result > > > is > > > > >> that > > > > >> >> >> the > > > > >> >> >> > > index > > > > >> >> >> > > > > is > > > > >> >> >> > > > > > > "hot". > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > Hope this helps. > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > Aaron > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > On Tue, Oct 1, 2013 at 10:09 AM, Ravikumar > > > > Govindarajan > > > > >> < > > > > >> >> >> > > > > > > [email protected]> wrote: > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > As I understand, > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > Lucene will store the files in following way > > > > >> per-segment > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > TIM file > > > > >> >> >> > > > > > > > Field1 ---> Some byte[] > > > > >> >> >> > > > > > > > Field2 ---> Some byte[] > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > TIP file > > > > >> >> >> > > > > > > > Field1 ---> Some byte[] > > > > >> >> >> > > > > > > > Field2 ---> Some byte[] > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > Blur will "sample" this lucene-file in the > > > following > > > > >> way > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > Field1 --> <TIM, start-offset>, <TIP, > > > start-offset>, > > > > >> ... > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > Field 2 --> <TIM, start-offset>, <TIP, > > > > start-offset>, > > > > >> ... > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > Is my understanding correct? > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > How does Blur warm-up the fields, when it does > > not > > > > >> know > > > > >> >> the > > > > >> >> >> > > > > > "end-offset" > > > > >> >> >> > > > > > > or > > > > >> >> >> > > > > > > > the "length" for each field to warm. > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > Will it by default read all Terms of a field? > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > -- > > > > >> >> >> > > > > > > > Ravi > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > >
