On Wed, Oct 9, 2013 at 1:57 AM, Ravikumar Govindarajan < [email protected]> wrote:
> Yes, I think bringing in a mutable file in lucene-index brings it's own set > of problems to handle. Filters, Caches, Scoring, Snapshots/Commits etc... > will all be affected. > > There is on JIRA on writing generation of updatable files, just like > doc-deletes instead of over-writing a single file.[ > https://issues.apache.org/jira/browse/LUCENE-4258]. But that is still > in-progress and from what I understand, it could slow searches > considerably. > I think that we can add whatever api extension point that are necessary to have users modify anything. However I the more I think about things the more I think that there are too many rules built on immutability. Block cache, HDFS, filters, search cache, etc. > > BTW, is it possible to extend BlurPartitioner and load it during start-up? > Not yet, but anything is just a patch away. :-) > > Also, it would be awesome if Blur supports a per-row auto-complete feature. > Not sure what you mean. Are you talking about in the shell? Aaron > > -- > Ravi > > > On Sat, Oct 5, 2013 at 2:01 AM, Aaron McCurry <[email protected]> wrote: > > > I have thought of one possible problem with this approach. To date the > > mindset I have used in all of the Blur internals is that segments are > > immutable. This is a fundamental principle that Blur uses and I don't > > really have any ideas on where to behind checking for when this is a > > problem. I know filters are going to be an issue, not sure where else. > > > > Not saying that it can't be done, it's just not going to be as clean as I > > originally thought. > > > > Aaron > > > > > > On Fri, Oct 4, 2013 at 4:26 PM, Aaron McCurry <[email protected]> > wrote: > > > > > > > > > > > On Fri, Oct 4, 2013 at 7:15 AM, Ravikumar Govindarajan < > > > [email protected]> wrote: > > > > > >> On a related note, do you think such an approach will fit in Blur > > >> > > >> 1. Store the BDB file in shard-server itself. > > >> > > > > > > Probably not, this would pin the BDB (or whatever the solution would > be) > > > to a specific server. We will have to sync to HDFS. > > > > > > > > >> > > >> 2. Apply all incoming partial doc-updates to local BDB file as well as > > an > > >> update-transaction log > > >> > > > > > > Blur already has a write ahead log as apart of internals. It's written > > > and synced to HDFS. > > > > > > > > >> > > >> 3. Periodically sync dirty BDB files to HDFS and roll-over the update- > > >> transaction log. > > > > > > > > >> Whenever a shard-server goes down, the take-over server can initially > > sync > > >> the BDB file from HDFS to local, replay the update-transaction log and > > >> then > > >> start serving data > > >> > > > > > > Blur already does this internally, it records the mutates and replays > > them > > > if a failure happens before a commit. > > > > > > Aaron > > > > > > > > >> > > >> -- > > >> Ravi > > >> > > >> > > >> On Thu, Oct 3, 2013 at 11:14 PM, Ravikumar Govindarajan < > > >> [email protected]> wrote: > > >> > > >> > The mutate APIs are a good fit for individual cols update. BlurCodec > > >> will > > >> > be cool and solve a lot of problems. > > >> > > > >> > There are 3 caveats for such a codec > > >> > > > >> > 1. Scores for affected queries will be wrong, until segment-merge > > >> > > > >> > 2. Responsibility of ordering updates must be on the client. > > >> > > > >> > 3. Repeated updates for the same document can either take a > > generational > > >> > approach [Lucene-4258] or use a single version of storage [Redis/TC > > >> etc..], > > >> > pushing the onus to client, depending on how the Codec shapes up. > > >> > > > >> > The former will be semantically correct but really sluggish while > the > > >> > latter will be faster during search > > >> > > > >> > > > >> > > > >> > On Thu, Oct 3, 2013 at 8:53 PM, Aaron McCurry <[email protected]> > > >> wrote: > > >> > > > >> >> On Thu, Oct 3, 2013 at 11:08 AM, Ravikumar Govindarajan < > > >> >> [email protected]> wrote: > > >> >> > > >> >> > Yeah, you are correct. A BDB file might probably never be ported > to > > >> >> HDFS. > > >> >> > > > >> >> > Our daily update frequency comes to about 20% of insertion rate. > > >> >> > > > >> >> > Lets say "UPDATE <TABLE> SET COL2=1 WHERE COL1=X". > > >> >> > > > >> >> > This update could potentially span across tens of thousands of > SQL > > >> rows > > >> >> in > > >> >> > our case, where COL2 is just a boolean flip. > > >> >> > > > >> >> > The problem is not with lucene's ability to handle load. Instead > it > > >> is > > >> >> with > > >> >> > the consistent load it puts on our content servers to read and > > >> >> re-tokenize > > >> >> > such huge rows just for a boolean flip. Another big winner is > that > > >> all > > >> >> our > > >> >> > updatable fields are not involved in scoring at all. Just > matching > > >> will > > >> >> do. > > >> >> > > > >> >> > The changes also sit in BDB only till the next segment merge, > after > > >> >> which > > >> >> > it is cleaned out. There is very little perf hit here for us, as > > >> users > > >> >> > don't immediately search after a change. > > >> >> > > > >> >> > I am afraid there is no documentation/code/numbers on this > > currently > > >> in > > >> >> > public, as it is still proprietary but is remarkably similar to > the > > >> >> popular > > >> >> > to RedisCodec. > > >> >> > > > >> >> > "If you really need partial document updates, there would need to > > be > > >> >> > changes > > >> >> > throughout the entire stack" > > >> >> > > > >> >> > You mean, the entire stack of Blur? In case this is possible, can > > you > > >> >> give > > >> >> > me 10000-ft overview of what you have in mind? > > >> >> > > > >> >> > > >> >> Interesting, now that I think about it. The situation that you > > >> describe > > >> >> is > > >> >> very interesting, I'm wondering if we came up with something like > > this > > >> in > > >> >> Blur that it would fix our large Row issue. Or at the very least > > help > > >> the > > >> >> problem. > > >> >> > > >> >> https://issues.apache.org/jira/browse/BLUR-220 > > >> >> > > >> >> Plus the more I think about it, the mutate methods are probably the > > >> right > > >> >> implementation for modifying single columns. So the API of Blur > > >> probably > > >> >> wouldn't need to be changed. Maybe just the way it goes about > > dealing > > >> >> with > > >> >> changes. I thinking maybe we need our own BlurCodec to handle > large > > >> Rows > > >> >> as well as Record (Document) updates. > > >> >> > > >> >> As an aside I constantly am having to refer to Records as > Documents, > > >> this > > >> >> is why I think we need a rename. > > >> >> > > >> >> Aaron > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > > >> >> > -- > > >> >> > Ravi > > >> >> > > > >> >> > > > >> >> > On Thu, Oct 3, 2013 at 5:36 PM, Aaron McCurry < > [email protected]> > > >> >> wrote: > > >> >> > > > >> >> > > The biggest issue with this is that the shards (the indexes) > > >> inside of > > >> >> > Blur > > >> >> > > actually move from one server to another. So to support this > > >> behavior > > >> >> > all > > >> >> > > the indexes are stored in HDFS. Do due the differences between > > >> HDFS > > >> >> and > > >> >> > > the a normal POSIX file system, I highly doubt that the BDB > file > > >> form > > >> >> in > > >> >> > > TokyoCabinet can ever be supported. > > >> >> > > > > >> >> > > If you really need partial document updates, there would need > to > > be > > >> >> > changes > > >> >> > > throughout the entire stack. I am curious why you need this > > >> feature? > > >> >> Do > > >> >> > > you have that many updates to the index? What is the update > > >> >> frequency? > > >> >> > > I'm just curious of what kind of performance you get out of a > > >> setup > > >> >> like > > >> >> > > that? Since I haven't ever run such a setup I have no idea how > > to > > >> >> > compare > > >> >> > > that kind of system to a base Lucene setup. > > >> >> > > > > >> >> > > Could you point be to some code or documentation? I would to > go > > >> and > > >> >> > take a > > >> >> > > look. > > >> >> > > > > >> >> > > Thanks, > > >> >> > > Aaron > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > On Thu, Oct 3, 2013 at 7:00 AM, Ravikumar Govindarajan < > > >> >> > > [email protected]> wrote: > > >> >> > > > > >> >> > > > One more help. > > >> >> > > > > > >> >> > > > We also maintain a file by name "BDB", just like the "Sample" > > >> file > > >> >> for > > >> >> > > > tracing used by Blur. > > >> >> > > > > > >> >> > > > This "BDB" file pertains to TokyoCabinet and is used purely > for > > >> >> > > supporting > > >> >> > > > partial updates to a document. > > >> >> > > > All operations on this file rely on local file-paths only, > > >> through > > >> >> the > > >> >> > > use > > >> >> > > > of native code. > > >> >> > > > Currently, all update requests are local to the index files > and > > >> it > > >> >> > > becomes > > >> >> > > > trivial to support. > > >> >> > > > > > >> >> > > > Any pointers on how to take this forward in Blur set-up of > > >> >> > shard-servers > > >> >> > > & > > >> >> > > > controllers? > > >> >> > > > > > >> >> > > > -- > > >> >> > > > Ravi > > >> >> > > > > > >> >> > > > > > >> >> > > > On Tue, Oct 1, 2013 at 10:15 PM, Aaron McCurry < > > >> [email protected]> > > >> >> > > wrote: > > >> >> > > > > > >> >> > > > > You can control the fields to warmup via: > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> > > > http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_TableDescriptor > > >> >> > > > > > > >> >> > > > > The preCacheCols field. The comment is wrong however, so I > > >> will > > >> >> > > create a > > >> >> > > > > task to correct. The use of the field is: "family.column" > > just > > >> >> like > > >> >> > > you > > >> >> > > > > would search. > > >> >> > > > > > > >> >> > > > > Aaron > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > On Tue, Oct 1, 2013 at 12:41 PM, Ravikumar Govindarajan < > > >> >> > > > > [email protected]> wrote: > > >> >> > > > > > > >> >> > > > > > Thanks Aaron > > >> >> > > > > > > > >> >> > > > > > General sampling and warming is fine and the code is > really > > >> >> concise > > >> >> > > and > > >> >> > > > > > clear. > > >> >> > > > > > > > >> >> > > > > > The act of reading > > >> >> > > > > > brings the data into the block cache and the result is > that > > >> the > > >> >> > index > > >> >> > > > is > > >> >> > > > > > "hot". > > >> >> > > > > > > > >> >> > > > > > Will all the terms of a field be read and brought into > the > > >> >> cache? > > >> >> > If > > >> >> > > > so, > > >> >> > > > > > then it has an obvious implication to avoid fields like, > > say > > >> >> > > > > > attachment-data from warming up, provided queries don't > > often > > >> >> > include > > >> >> > > > > such > > >> >> > > > > > fields > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > On Tue, Oct 1, 2013 at 7:58 PM, Aaron McCurry < > > >> >> [email protected]> > > >> >> > > > > wrote: > > >> >> > > > > > > > >> >> > > > > > > Take a look at this package. > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> > > > https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=tree;f=blur-store/src/main/java/org/apache/blur/lucene/warmup;h=f4239b1947965dc7fe8218eaa16e3f39ecffdda0;hb=apache-blur-0.2 > > >> >> > > > > > > > > >> >> > > > > > > Basically when the warmup process starts (which is > > >> >> asynchronous > > >> >> > to > > >> >> > > > the > > >> >> > > > > > rest > > >> >> > > > > > > of the application) it flips a thread local switch to > > allow > > >> >> for > > >> >> > > > tracing > > >> >> > > > > > of > > >> >> > > > > > > the file accesses. The sampler will sample each of the > > >> >> fields in > > >> >> > > > each > > >> >> > > > > > > segment and create a sample file that attempts to > detect > > >> the > > >> >> > > > boundaries > > >> >> > > > > > of > > >> >> > > > > > > each field within each file within each segment. Then > it > > >> >> stores > > >> >> > > the > > >> >> > > > > > sample > > >> >> > > > > > > info into the directory beside each segment (so that > way > > it > > >> >> > doesn't > > >> >> > > > > have > > >> >> > > > > > to > > >> >> > > > > > > re-sample the segment). After the sampling is complete > > or > > >> >> > loaded, > > >> >> > > > the > > >> >> > > > > > > warmup just reads the binary data from each file. The > > act > > >> of > > >> >> > > reading > > >> >> > > > > > > brings the data into the block cache and the result is > > that > > >> >> the > > >> >> > > index > > >> >> > > > > is > > >> >> > > > > > > "hot". > > >> >> > > > > > > > > >> >> > > > > > > Hope this helps. > > >> >> > > > > > > > > >> >> > > > > > > Aaron > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > On Tue, Oct 1, 2013 at 10:09 AM, Ravikumar > Govindarajan < > > >> >> > > > > > > [email protected]> wrote: > > >> >> > > > > > > > > >> >> > > > > > > > As I understand, > > >> >> > > > > > > > > > >> >> > > > > > > > Lucene will store the files in following way > > per-segment > > >> >> > > > > > > > > > >> >> > > > > > > > TIM file > > >> >> > > > > > > > Field1 ---> Some byte[] > > >> >> > > > > > > > Field2 ---> Some byte[] > > >> >> > > > > > > > > > >> >> > > > > > > > TIP file > > >> >> > > > > > > > Field1 ---> Some byte[] > > >> >> > > > > > > > Field2 ---> Some byte[] > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > Blur will "sample" this lucene-file in the following > > way > > >> >> > > > > > > > > > >> >> > > > > > > > Field1 --> <TIM, start-offset>, <TIP, start-offset>, > > ... > > >> >> > > > > > > > > > >> >> > > > > > > > Field 2 --> <TIM, start-offset>, <TIP, start-offset>, > > ... > > >> >> > > > > > > > > > >> >> > > > > > > > Is my understanding correct? > > >> >> > > > > > > > > > >> >> > > > > > > > How does Blur warm-up the fields, when it does not > know > > >> the > > >> >> > > > > > "end-offset" > > >> >> > > > > > > or > > >> >> > > > > > > > the "length" for each field to warm. > > >> >> > > > > > > > > > >> >> > > > > > > > Will it by default read all Terms of a field? > > >> >> > > > > > > > > > >> >> > > > > > > > -- > > >> >> > > > > > > > Ravi > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> > > > >> > > > >> > > > > > > > > >
