HI Ankiet, I think If update/delete is for less data then horizontal compaction can based on user configuration, but if more data is getting updated then better to start vertical compaction immediately , this is because we are not physically deleting the data from disk, if more data is getting updated(more than 60%) then during query first we will query the older data + exclude the deleted records+ include the update delta file data. So in this case more data will come into memory, we can avoid this by starting vertical compaction immediately after update/delete.
-Regards Kumar Vishal On Thu, Nov 24, 2016 at 2:43 PM, Kumar Vishal <[email protected]> wrote: > Hi Aniket, > > I agree with Vimal opinion, but that use case will be very less. > > I have one query for this update and delete feature. > When we will start compaction after each update or delete operation? > > -Regards > Kumar Vishal > > > > On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik <[email protected]> > wrote: > >> Hi Vimal, >> >> Thanks for your suggestions. >> For the 1st point, i tend to agree with Manish's comments. But, it's worth >> looking into different ways to optimize the performance. >> I guess, query performance may take priority over update performance. >> Basically, we may need better compaction approach to merge >> delta files into regular carbon files to maintain adequate performance. >> For the 2nd point, CarbonData will support updating multiple rows, but not >> the same row multiple times in a single update operation. It is possible >> that join condition in sub-select of original update statement can result >> into multiple rows from source table for the same row in the target table. >> This is ambiguous condition and common ways to solve this is to error out >> , >> or to apply first matching row, or to apply last matching row. CarbonData >> will choose to error out and let user resolve the ambiguity, which a >> safer/standard choice. >> >> Best Regards, >> Aniket >> >> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <[email protected]> >> wrote: >> >> > Hi Vimal, >> > >> > I have few queries regarding regarding the 1st suggestion. >> > >> > 1. Dimensions can both be dictionary and no dictionary. If we update the >> > dictionary file then we will have to maintain 2 flows one for dictionary >> > columns and 1 for no dictionary columns. Will that be ok? >> > >> > 2. We write dictionary files in append mode. Updating dictionary files >> will >> > be like completely rewriting the dictionary file which will also modify >> the >> > dictionary metadata and sort index file OR there is some other approach >> > that needs to be followed like maintaining a update delta mapping for >> > dictionary file. >> > >> > Regards >> > Manish Gupta >> > >> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath < >> > [email protected]> wrote: >> > >> > > Hi Aniket, >> > > >> > > The design looks sound and the documentation is great. >> > > I have few suggestions. >> > > >> > > 1) Measure update vs dimension update : In case of dimension update. >> for >> > > example user wants to change dept1 to dept2 for all users who are >> under >> > > dept1. Can we just update the dictionary for faster performance? >> > > 2) Update Semantics (one matching record vs multiple matching >> record): I >> > > could not understand this section. Wanted to confirm if we will >> support >> > one >> > > update statement updating multiple rows. >> > > >> > > -Vimal >> > > >> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[email protected]> >> > > wrote: >> > > >> > > > Hi Aniket >> > > > >> > > > Thanks you finished the good design documents. A couple of inputs >> from >> > my >> > > > side: >> > > > >> > > > 1.Please add the below mentioned info(Rowid definition etc.) to >> design >> > > > documents also. >> > > > 2.In page6 :"Schema change operation can run in parallel with >> Update or >> > > > Delte operations, but not with another schema change operation" , >> can >> > you >> > > > explain this item ? >> > > > 3.Please unify the description: use "CarbonData" to replace >> "Carbon", >> > > > unify the description for "destination table" and "target table". >> > > > 4.The Update operation's delete delta is same with Delete >> operation's >> > > > delete >> > > > delta? >> > > > >> > > > BTW, it would be much better if you could provide google docs for >> > review >> > > in >> > > > the next time, it is really difficult to give comment based on pdf >> > > > documents >> > > > :) >> > > > >> > > > Regards >> > > > Liang >> > > > >> > > > Aniket Adnaik wrote >> > > > > Hi Sujith, >> > > > > >> > > > > Please see my comments inline. >> > > > > >> > > > > Best Regards, >> > > > > Aniket >> > > > > >> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko < >> > > > >> > > > > sujithchacko.2010@ >> > > > >> > > > > > >> > > > > wrote: >> > > > > >> > > > >> Hi Aniket, >> > > > >> >> > > > >> Its a well documented design, just want to know few points >> > like >> > > > >> >> > > > >> a. Format of the RowID and its datatype >> > > > >> >> > > > > AA>> Following format can be used to represent a unique rowed; >> > > > > >> > > > > [ >> > > > > <Segment ID> >> > > > > <Block ID> >> > > > > <Blocklet ID> >> > > > > <Offset in Blocklet> >> > > > > ] >> > > > > A simple way would be to use String data type and store it as a >> text >> > > > > file. >> > > > > However, more efficient way could be to use Bitsets/Bitmaps as >> > further >> > > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be >> used >> > > for >> > > > > better performance and efficient storage. >> > > > > >> > > > > b. Impact of this feature in select query since every time query >> > > process >> > > > > has to exclude each deleted records and include corresponding >> updated >> > > > > record, any optimization is considered in tackling the query >> > > performance >> > > > > issue since one of the major highlights of carbon is performance. >> > > > > AA>> Some of the optimizations would be to cache the deltas to >> avoid >> > > > > recurrent I/O, >> > > > > to store sorted rowids in delete delta for efficient lookup, and >> > > perform >> > > > > regular compaction to minimize the impact on select query >> > performance. >> > > > > Additionally, we may have to explore ways to perform compaction >> > > > > automatically, for example, if more than 25% of rows are read from >> > > > deltas. >> > > > > Please feel free to share if you have any ideas or suggestions. >> > > > > >> > > > > Thanks, >> > > > > Sujith >> > > > > >> > > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" < >> > > > >> > > > > aniket.adnaik@ >> > > > >> > > > > > wrote: >> > > > > >> > > > >> Hi All, >> > > > >> >> > > > >> Please find a design doc for Update/Delete support in CarbonData. >> > > > >> >> > > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU >> /view? >> > > > >> usp=sharing >> > > > >> >> > > > >> Best Regards, >> > > > >> Aniket >> > > > >> >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > View this message in context: http://apache-carbondata- >> > > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design- >> > > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html >> > > > Sent from the Apache CarbonData Mailing List archive mailing list >> > archive >> > > > at Nabble.com. >> > > > >> > > >> > >> > >
