Re: [Feature ]Design Document for Update/Delete support in CarbonData

Kumar Vishal Thu, 24 Nov 2016 01:33:07 -0800

HI Ankiet,

I think If update/delete is for less data then horizontal compaction can
based on user configuration, but if more data is getting updated then
better to start vertical compaction immediately , this is because we are
not physically deleting the data from disk, if more data is getting
updated(more than 60%) then during query first we will query the older data
+ exclude the deleted records+ include the update delta file data. So in
this case more data will come into memory, we can avoid this by starting
vertical compaction immediately after update/delete.


-Regards
Kumar Vishal

On Thu, Nov 24, 2016 at 2:43 PM, Kumar Vishal <[email protected]>
wrote:

> Hi Aniket,
>
> I agree with Vimal opinion, but that use case will be very less.
>
> I have one query for this update and delete feature.
> When we will start compaction after each update or delete operation?
>
> -Regards
> Kumar Vishal
>
>
>
> On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik <[email protected]>
> wrote:
>
>> Hi Vimal,
>>
>> Thanks for your suggestions.
>> For the 1st point, i tend to agree with Manish's comments. But, it's worth
>> looking into different ways to optimize the performance.
>> I guess, query performance may take priority over update performance.
>> Basically, we may need better compaction approach to merge
>> delta files into regular carbon files to maintain adequate performance.
>> For the 2nd point, CarbonData will support updating multiple rows, but not
>> the same row multiple times in a single update operation. It is possible
>> that join condition in sub-select of original update statement can result
>> into multiple rows from source table for the same row in the target table.
>> This is ambiguous condition and common ways to solve this is to error out
>> ,
>> or to apply first matching row, or to apply last matching row. CarbonData
>> will choose to error out and let user resolve the ambiguity, which a
>> safer/standard choice.
>>
>> Best Regards,
>> Aniket
>>
>> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <[email protected]>
>> wrote:
>>
>> > Hi Vimal,
>> >
>> > I have few queries regarding regarding the 1st suggestion.
>> >
>> > 1. Dimensions can both be dictionary and no dictionary. If we update the
>> > dictionary file then we will have to maintain 2 flows one for dictionary
>> > columns and 1 for no dictionary columns. Will that be ok?
>> >
>> > 2. We write dictionary files in append mode. Updating dictionary files
>> will
>> > be like completely rewriting the dictionary file which will also modify
>> the
>> > dictionary metadata and sort index file OR there is some other approach
>> > that needs to be followed like maintaining a update delta mapping for
>> > dictionary file.
>> >
>> > Regards
>> > Manish Gupta
>> >
>> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
>> > [email protected]> wrote:
>> >
>> > > Hi Aniket,
>> > >
>> > > The design looks sound and the documentation is great.
>> > > I have few suggestions.
>> > >
>> > > 1) Measure update vs dimension update : In case of dimension update.
>> for
>> > > example user wants to change dept1 to dept2 for all users who are
>> under
>> > > dept1. Can we just update the dictionary for faster performance?
>> > > 2) Update Semantics (one matching record vs multiple matching
>> record): I
>> > > could not understand this section. Wanted to confirm if we will
>> support
>> > one
>> > > update statement updating multiple rows.
>> > >
>> > > -Vimal
>> > >
>> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[email protected]>
>> > > wrote:
>> > >
>> > > > Hi  Aniket
>> > > >
>> > > > Thanks you finished the good design documents. A couple of inputs
>> from
>> > my
>> > > > side:
>> > > >
>> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
>> design
>> > > > documents also.
>> > > > 2.In page6 :"Schema change operation can run in parallel with
>> Update or
>> > > > Delte operations, but not with another schema change operation" ,
>> can
>> > you
>> > > > explain this item ?
>> > > > 3.Please unify the description:  use "CarbonData" to replace
>> "Carbon",
>> > > > unify the description for "destination table" and "target table".
>> > > > 4.The Update operation's delete delta is same with Delete
>> operation's
>> > > > delete
>> > > > delta?
>> > > >
>> > > > BTW, it would be much better if you could provide google docs for
>> > review
>> > > in
>> > > > the next time, it is really difficult to give comment based on pdf
>> > > > documents
>> > > > :)
>> > > >
>> > > > Regards
>> > > > Liang
>> > > >
>> > > > Aniket Adnaik wrote
>> > > > > Hi Sujith,
>> > > > >
>> > > > > Please see my comments inline.
>> > > > >
>> > > > > Best Regards,
>> > > > > Aniket
>> > > > >
>> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
>> > > >
>> > > > > sujithchacko.2010@
>> > > >
>> > > > > &gt;
>> > > > > wrote:
>> > > > >
>> > > > >> Hi Aniket,
>> > > > >>
>> > > > >>       Its a well documented design,  just want to know few points
>> > like
>> > > > >>
>> > > > >> a.  Format of the RowID and its datatype
>> > > > >>
>> > > > >  AA>> Following format can be used to represent a unique rowed;
>> > > > >
>> > > > >  [
>> > > > > <Segment ID>
>> > > > > <Block ID>
>> > > > > <Blocklet ID>
>> > > > > <Offset in Blocklet>
>> > > > > ]
>> > > > >  A simple way would be to use String data type and store it as a
>> text
>> > > > > file.
>> > > > > However, more efficient way could be to use Bitsets/Bitmaps as
>> > further
>> > > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be
>> used
>> > > for
>> > > > > better performance and efficient storage.
>> > > > >
>> > > > > b.  Impact of this feature in select query since every time query
>> > > process
>> > > > > has to exclude each deleted records and include corresponding
>> updated
>> > > > > record, any optimization is considered in tackling the query
>> > > performance
>> > > > > issue since one of the major highlights of carbon is performance.
>> > > > > AA>> Some of the optimizations would be  to cache the deltas to
>> avoid
>> > > > > recurrent I/O,
>> > > > > to store sorted rowids in delete delta for efficient lookup, and
>> > > perform
>> > > > > regular compaction to minimize the impact on select query
>> > performance.
>> > > > > Additionally, we may have to explore ways to perform compaction
>> > > > > automatically, for example, if more than 25% of rows are read from
>> > > > deltas.
>> > > > > Please feel free to share if you have any ideas or suggestions.
>> > > > >
>> > > > > Thanks,
>> > > > > Sujith
>> > > > >
>> > > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
>> > > >
>> > > > > aniket.adnaik@
>> > > >
>> > > > > &gt; wrote:
>> > > > >
>> > > > >> Hi All,
>> > > > >>
>> > > > >> Please find a design doc for Update/Delete support in CarbonData.
>> > > > >>
>> > > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU
>> /view?
>> > > > >> usp=sharing
>> > > > >>
>> > > > >> Best Regards,
>> > > > >> Aniket
>> > > > >>
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > View this message in context: http://apache-carbondata-
>> > > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
>> > > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
>> > > > Sent from the Apache CarbonData Mailing List archive mailing list
>> > archive
>> > > > at Nabble.com.
>> > > >
>> > >
>> >
>>
>
>

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Reply via email to