@ajantha Even from carbon to carbon table, the scenarios which i mentioned may be applicable. as i told above even though the schemas are same in all aspect but if there is a difference in column properties how you are going to handle.
If the destination table needs a bad record feature enabled i feel you shall perform this, Or for better performance you suggest/recommend user to explicitly disable the unwanted steps like bad record feature if he/she doesn't care about bad records while inserting. Hope u got my point. Implicitly it will be a risk to determine whether bad records shall be required or not by assuming source and destination table posses exactly same schema in all conditions. I think you shall relook into this part. Could you share a design document in JIRA or in mail? On Thu, Jan 2, 2020 at 7:24 AM Ajantha Bhat <[email protected]> wrote: > Hi sujith, > > I still keep converter step for some scenarios like insert from parquet to > carbon, we need an optimized converter here to convert from timestamp long > value (divide by 1000) and convert null values of direct dictionary to 1. > So, for the scenarios you mentioned, I will be using this flow with > optimized converter. > > For carbon to carbon insert with same source and destination > properties(this is common scenario in cloud migration) , it goes to no > converter step and use direct spark internal row till write step. > Compaction also can use this no converter step. > > Thanks, > Ajantha > > On Thu, 2 Jan, 2020, 12:18 am sujith chacko, <[email protected]> > wrote: > > > Hi Ajantha, > > > > Thanks for your initiative, I have couple of questions even though. > > > > a) As per your explanation the dataset validation is already done as part > > of the source table, this is what you mean? What I understand is the > insert > > select queries are going to get some benefits since we don’t do some > > additional steps. > > > > What about if your destination table has some different table properties > > like few columns may have non null properties or date format or decimal > > precision’s or scale may be different. > > So you may need a bad record support then , how you are going to handle > > such scenarios? Correct me if I misinterpreted your points. > > > > Regards, > > Sujith > > > > > > On Fri, 20 Dec 2019 at 5:25 AM, Ajantha Bhat <[email protected]> > > wrote: > > > > > Currently carbondata "insert into" uses the CarbonLoadDataCommand > itself. > > > Load process has steps like parsing and converter step with bad record > > > support. > > > Insert into doesn't require these steps as data is already validated > and > > > converted from source table or dataframe. > > > > > > Some identified changes are below. > > > > > > 1. Need to refactor and separate load and insert at driver side to skip > > > converter step and unify flow for No sort and global sort insert. > > > 2. Need to avoid reorder of each row. By changing select dataframe's > > > projection order itself during the insert into. > > > 3. For carbon to carbon insert, need to provide the ReadSupport and use > > > RecordReader (vector reader currently doesn't support ReadSupport) to > > > handle null values, time stamp cutoff (direct dictionary) from scanRDD > > > result. > > > 4. Need to handle insert into partition/non-partition table in local > > sort, > > > global sort, no sort, range columns, compaction flow. > > > > > > The final goal is to improve insert performance by keeping only > required > > > logic and also decrease the memory footprint. > > > > > > If you have any other suggestions or optimizations related to this let > me > > > know. > > > > > > Thanks, > > > Ajantha > > > > > >
