Re: [VOTE] Apache CarbonData 2.2.0(RC2) release

2021-08-02 Thread Ajantha Bhat
+1

Regards,
Ajantha

On Mon, Aug 2, 2021 at 9:03 PM Venkata Gollamudi 
wrote:

> +1
>
> Regards,
> Venkata Ramana
>
> On Mon, 2 Aug, 2021, 20:18 Kunal Kapoor,  wrote:
>
> > +1
> >
> > Regards
> > Kunal Kapoor
> >
> > On Mon, 2 Aug 2021, 4:53 pm Kumar Vishal, 
> > wrote:
> >
> > > +1
> > > Regards
> > > Kumar Vishal
> > >
> > > On Mon, 2 Aug 2021 at 2:28 PM, Indhumathi M 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Regards,
> > > > Indhumathi M
> > > >
> > > > On Mon, Aug 2, 2021 at 12:33 PM Akash Nilugal <
> akashnilu...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I submit the Apache CarbonData 2.2.0(RC2) for your vote.
> > > > >
> > > > >
> > > > > *1.Release Notes:*
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12347869=Html=12320220=Create_token=A5KQ-2QAV-T4JA-FDED_d44fca7058ab2c2a2a4a24e02264cc701f7d10b8_lin
> > > > >
> > > > >
> > > > > *Some key features and improvements in this release:*
> > > > >- Integrate with Apache Spark-3.1
> > > > >- Leverage Secondary Index till segment level with SI as datamap
> > and
> > > > SI
> > > > > with plan rewrite
> > > > >- Make Secondary Index as a coarse grain datamap and use
> secondary
> > > > > indexes for Presto queries
> > > > >- Support rename SI table
> > > > >- Support describe column
> > > > >- Local sort Partition Load and Compaction improvement
> > > > >- GeoSpatial Query Enhancements
> > > > >- Improve the table status and segment file writing
> > > > >- Improve the carbon CDC performance and introduce APIs to
> UPSERT,
> > > > > DELETE, UPDATE and DELETE
> > > > >- Improvements clean file and rename performance
> > > > >
> > > > > *2. The tag to be voted upon:* apache-carbondata-2.2.0-rc2
> > > > >
> > https://github.com/apache/carbondata/tree/apache-carbondata-2.2.0-rc2
> > > > >
> > > > > Commit: c3a908b51b2f590eb76eb4f4d875cd568dbece40
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/carbondata/commit/c3a908b51b2f590eb76eb4f4d875cd568dbece40
> > > > >
> > > > >
> > > > > *3. The artifacts to be voted on are located here:*
> > > > > https://dist.apache.org/repos/dist/dev/carbondata/2.2.0-rc2
> > > > >
> > > > > *4. A staged Maven repository is available for review at:*
> > > > >
> > > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1071/
> > > > >
> > > > >
> > > > > Please vote on releasing this package as Apache CarbonData 2.2.0,
> > The
> > > > vote
> > > > > will be open for the next 72 hours and passes if a majority of at
> > least
> > > > > three +1
> > > > > PMC votes are cast.
> > > > >
> > > > > [ ] +1 Release this package as Apache CarbonData 2.2.0
> > > > >
> > > > > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> > > > >
> > > > > [ ] -1 Do not release this package because...
> > > > >
> > > > >
> > > > > Regards,
> > > > > Akash R Nilugal
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSSION] Support JOIN query with spatial index

2021-04-27 Thread Ajantha Bhat
ok.
+1 from my side.

If polygon join query still has performance bottleneck, we can later
optimize it.

Thanks,
Ajantha

On Tue, Apr 27, 2021 at 3:59 PM Indhumathi  wrote:

> Thanks Ajantha for your inputs.
>
> I have modified the design, by adding ToRangeList Udf filter as a implicit
> column projection to the polygon table dataframe and modified the JOIN
> condition with range list udf column, in order to improve performance.
>
> By this way, we can avoid making quadtree from N*M times to M times.
> I have attached new design document in the JIRA.
> CARBONDATA-4166 
>
> Regards,
> Indhumathi
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[Design Discussion] Transaction manager, time travel and segment interface refactoring

2021-04-22 Thread Ajantha Bhat
Hi All,
In this thread, I am continuing the below discussion along with the
Transaction Manager and Time Travel feature design.
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Taking-the-inputs-for-Segment-Interface-Refactoring-td101950.html

The goal of this requirement is as follows.

   1.

   Implement a “Transaction Manager” with optimistic concurrency to provide
   within a table transaction/versioning. (interfaces should also be
   flexible enough to support across table transactions)
   2.

   Support time travel in carbonData.
   3.

   Decouple and clean up segment interfaces. (which should also help in
   supporting segment concepts to other open formats under carbonData metadata
   service)


The design document is attached in JIRA.
JIRA link: https://issues.apache.org/jira/browse/CARBONDATA-4171
GoogleDrive link:
https://docs.google.com/document/d/1FsVsXjj5QCuFDrzrayN4Qo0LqWc0Kcijc_jL7pCzfXo/edit?usp=sharing

Please have a look. suggestions are welcome.
I have mentioned some TODO in the document, I will be updating it in the V2
version soon.
Implementation will be done by adding subtasks under the same JIRA.

Thanks,
Ajantha


Re: [DISCUSSION] Support JOIN query with spatial index

2021-04-19 Thread Ajantha Bhat
Hi,
I think now the latest document has addressed my previous comments and
questions.

polygon list query and polyline list query design looks ok.

But the design of polygon query with join, I have performance concern.
In this approach, we are using union polygon filter on spatial_table to
prune till blocklet.
It may identify all the rows in blocklet in the worst case and with this
output (N) we will perform join with the polygon table output(M).
which will again check IN_POLYGON condition during join (N*M) times. I too
don't have any different solution at the moment.

But we can optimize the current solution further by below points:
a) Here for the polygon table output you can reduce making quadtree for N*M
times to M times and use the quadtree output as range filter/UDF for join.
b) Also later if we need more improvement, maybe we can try row-level
filtering on the spatial table.

Thanks,
Ajantha



On Thu, Apr 15, 2021 at 9:37 PM Indhumathi  wrote:

> Hello all,
>
> Please find the design document link attached in JIRA,  CARBONDATA-4166
> 
> Any inputs/suggestions from the community is most welcomed.
>
> Regards,
> Indhumathi
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Improve carbondata CDC performance

2021-03-30 Thread Ajantha Bhat
+1 for this improvement,

But as this optimization is dependent on the data. There may be a scenario
where after you prune with min max also your dataset size remain almost
same as original.
Which brings in extra overhead of the new operations added.
Do you have plan to add some intelligence or threshold or fallback
mechanism for that case ?

Thanks,
Ajantha

On Mon, Mar 29, 2021 at 5:59 PM Indhumathi  wrote:

> +1
>
> Regards,
> Indhumathi M
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Support SI at Segment level

2021-03-30 Thread Ajantha Bhat
+1 for this proposal.

But the other ongoing requirement (
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-td105291.html)
is dependent on *isSITableEnabled*
so, better to wait for it to finish and redesign on top of it.

Thanks,
Ajantha

On Tue, Mar 23, 2021 at 1:03 PM Mahesh Raju Somalaraju <
maheshraju.o...@gmail.com> wrote:

> Hi,
>
> +1 for the feature.
> It will make the query faster.
>
> 1) With design discussion about the feature(SI to prune as a data frame)
> has one property to set.
>   If the data engine wants to use SI as datamap then need to set. if not
> set then it will use plan re-write flow.
>
>   So we have to handle this feature in two cases. Can you please check and
> update the design as per this?
>
> References:
> SI to prune as a data frame
>
> https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing
>
> Thanks & Regards
> Mahesh Raju Somalaraju
>
> On Wed, Feb 17, 2021 at 4:05 PM Nihal  wrote:
>
> > Hi all,
> >
> > Currently, if the parent(main) table and SI table don’t have the same
> valid
> > segments then we disable the SI table. And then from the next query
> > onwards,
> > we scan and prune only the parent table until we trigger the next load or
> > REINDEX command (as these commands will make the parent and SI table
> > segments in sync). Because of this, queries take more time to give the
> > result when SI is disabled.
> >
> > To solve this problem we are planning to support SI at the segment level.
> > It
> > means we will not disable SI if the parent and SI table don’t have the
> same
> > segments, while we will do the pruning on Si for all valid segments, and
> > for
> > the rest of the segments, we will do the pruning on main/parent table.
> >
> >
> > At the time of pruning with the main table in TableIndex.prune, if SI
> > exists
> > for the corresponding filter then all segments which are not present in
> the
> > SI table will be pruned on the corresponding parent table segment.
> >
> > Please let me know your thought and input about the same.
> >
> > Regards
> > Nihal kumar ojha
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [DISCUSSION] Describe complex columns

2021-03-30 Thread Ajantha Bhat
Hi,

+1 for this improvement.

a) you can also print one line of short information about the parent column
when describe column is executed
to avoid executing again to know what is parent column type.
Example,
 Describe column decimalcolumn on complexcarbontable;
*You can mention that decimalcolumn is a MAP<> type and children are as
follows.*

b) Are you blocking describe column on primitive type ? or just print short
information about the primitive data type.
I think the latter one is fine.

Thanks,
Ajantha


On Mon, Mar 22, 2021 at 9:37 PM akashrn5  wrote:

> Hi,
>
> +1 for the new functionality.
>
> my suggestion is to modify the DDL something like below
>
> DESCRIBE column fieldname ON [db_name.]table_name;
> DESCRIBE table short/transient [db_name.]table_name;
>
> Others can give their suggestions
>
> Thanks,
>
> Regards,
> Akash R
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Support JOIN query with spatial index

2021-03-30 Thread Ajantha Bhat
Hi, I have some doubts and suggestions for the same.

Currently, we support these UDFs --> IN_POLYGON, IN_POLYGON_LIST,
IN_POLYLINE_LIST, IN_POLYGON_RANGE_LIST
but the user needs to give polygon input manually and as polygon can have
many points, it is hard to give manually.
So, your requirement is to give new UDF , IN_POLYGON_JOIN where polygon
inputs are present in another table and you want to join with the main
table.

*please find my doubts below*
a. why to do join ? when you can form a subquery to query the polygons from
table2 and give it as input for IN_POLYGON UDF and other existing UDFs
b. no need to support the same for  IN_POLYLINE_LIST
and IN_POLYGON_RANGE_LIST UDF also ?

*Suggestions:*
a. Table names and queries are not matching, please update
b. The query doesn't look like the union query as explained in the diagram,
please update and explain
c. Please consider some sample data with examples for t1 and t2. Also,
provide the expected query result also
d. Also mention, how to select data from single polygon and multi polygon
from the tables.

Thanks,
Ajantha

On Tue, Mar 30, 2021 at 11:25 AM Kunal Kapoor 
wrote:

> +1
>
> On Mon, Mar 22, 2021 at 4:07 PM Indhumathi 
> wrote:
>
> > Hi community,
> >
> > Currently, carbon supports IN_POLYGON and IN_POLYGON_LIST udf's,
> > where user has to manually provide the polygon points(series of latitude
> > and longitude pair), to query carbon table based on spatial index.
> >
> > This feature will support JOIN tables based on IN_POLYGON udf
> > filter, where polygon data exists in a table.
> >
> > Please find below link of design doc. Please check and give
> > your inputs/suggestions.
> >
> >
> >
> https://docs.google.com/document/d/11PnotaAiEJQK_QvKsHznDy1I9tO4idflW32LstwcLhc/edit#heading=h.yh6qp815dh3p
> >
> >
> > Thanks & Regards,
> > Indhumathi M
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [DISCUSSION] Support alter schema for complex types

2021-03-30 Thread Ajantha Bhat
Hi Akshay,
The mail description and document content are not matching. For
single-level struct also document says cannot support.
So, please list down all the work that need to be done in points and
then divide which is supported in phase1 and which is supported in phase 2
clearly in the summary section of the document.

Also in the query flow, after adding the column, for previously loaded
segments what will be the output NULL or empty complex type ?
you can refer hive behavior for this.  Hope schema evolution (column drift)
also intact with complex column support.

Thanks,
Ajantha

On Tue, Mar 30, 2021 at 11:18 AM Kunal Kapoor 
wrote:

> +1
>
> On Fri, Mar 26, 2021 at 6:19 PM akshay_nuthala 
> wrote:
>
> > No, these and other nested level operations will be taken care in the
> next
> > phase.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [VOTE] Apache CarbonData 2.1.1(RC2) release

2021-03-29 Thread Ajantha Bhat
Hi all,

PMC vote has passed for Apache Carbondata 2.1.1 release, the result is as
below:

+1(binding): 5(Kunal Kapoor, David CaiQiang, Kumar Vishal, Ravindra Pesala,
Liang Chen)


+1(non-binding) : 2 (Akash, Indhumathi)


Thanks all for your vote.

On Mon, Mar 29, 2021 at 12:57 PM Liang Chen  wrote:

> +1(binding)
>
> Regards
> Liang
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion]Presto Queries leveraging Secondary Index

2021-03-29 Thread Ajantha Bhat
+1

Thanks,
Ajantha

On Mon, Mar 29, 2021 at 5:58 PM Indhumathi  wrote:

> +1 for design.
>
> Please find my comments.
>
> 1. About updating IndexStatus.ENABLED property, Need to consider
> compatibility scenarios as well.
> 2. Can update the query behavior when carbon.enable.distributed.index
> and carbon.disable.index.server.fallback is enabled.
>
>
> Regards,
> Indhumathi M
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[VOTE] Apache CarbonData 2.1.1(RC2) release

2021-03-26 Thread Ajantha Bhat
Hi All,

I submit the Apache CarbonData 2.1.1(RC2) for your vote.

*1.Release Notes:*
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349409=Html=12320220=Create_token=A5KQ-2QAV-T4JA-FDED_bb629ffd13f06db9dafa005fe7b737939b88ba5d_lin

*Some key features and improvements in this release:*

   - Geospatial index algorithm improvement and UDFs enhancement
   - Adding global sort support for SI segments data files merge operation.
   - Refactor CarbonDataSourceScan without Spark Filter
   - Size control of minor compaction
   - Clean files become data trash manager
   - Fix error when loading string field with high cardinality (local
dictionary fallback issue)


 *2. The tag to be voted upon* : apache-carbondata-2.1.1-rc2
<https://github.com/apache/carbondata/tree/apache-carbondata-2.1.1-rc2>

Commit:
770ea3967c81abcd61c28c4d9bb557da9ceb4322
<
https://github.com/apache/carbondata/commit/770ea3967c81abcd61c28c4d9bb557da9ceb4322
>

*3. The artifacts to be voted on are located here:*
https://dist.apache.org/repos/dist/dev/carbondata/2.1.1-rc2/

*4. A staged Maven repository is available for review at:*
https://repository.apache.org/content/repositories/orgapachecarbondata-1068/

*5. Release artifacts are signed with the following key:*
https://people.apache.org/keys/committer/ajantha.asc


Please vote on releasing this package as Apache CarbonData 2.1.1,
The vote will be open for the next 72 hours and passes if a majority of at
least
three +1 PMC votes are cast.

[ ] +1 Release this package as Apache CarbonData 2.1.1

[ ] 0 I don't feel strongly about it, but I'm okay with the release

[ ] -1 Do not release this package because...


Regards,
Ajantha Bhat


Re: DISCUSSION: propose to activate "Issues" of https://github.com/apache/carbondata

2021-03-18 Thread Ajantha Bhat
Hi,

After opening github issues tab, are we going to stop using JIRA?
If we keep both, then when to use JIRA and when to use issues?

Also as we have slack channel now, if user face issues then can directly
discuss in slack for quick support.

Thanks,
Ajantha

On Thu, 18 Mar, 2021, 5:29 pm Liang Chen,  wrote:

> Hi
>
> As you know,  for better managing community, i propose to put "Issues, Pull
> Request, Code" together and request Apache INFRA to activate "Issues" of
> github.
>
> Open discussion, please input your comments.
>
> Regards
> Liang
>


[VOTE] Apache CarbonData 2.1.1(RC1) release

2021-03-17 Thread Ajantha Bhat
Hi All,

I submit the Apache CarbonData 2.1.1(RC1) for your vote.

*1.Release Notes:*
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349409=Html=12320220=Create_token=A5KQ-2QAV-T4JA-FDED_5eb1fdbbf3210d3051e01492f8a498aaeeed8c28_lout

*Some key features and improvements in this release:*

   - Geo spatial index algorithm improvement and UDFs enhancement
   - Adding global sort support for SI segments data files merge operation.
   - Refactor CarbonDataSourceScan without Spark Filter
   - Size control of minor compaction
   - Clean files become data trash manager
   - Fix error when loading string field with high cardinality (local
dictionary fallback issue)


 *2. The tag to be voted upon* : apache-carbondata-2.1.1-rc1
<https://github.com/apache/carbondata/tree/apache-carbondata-2.1.1-rc1>

Commit:
8acab9ae7287c527fb9d7e103f91f0c2f7d02f81
<
https://github.com/apache/carbondata/commit/8acab9ae7287c527fb9d7e103f91f0c2f7d02f81
>

*3. The artifacts to be voted on are located here:*
https://dist.apache.org/repos/dist/dev/carbondata/2.1.1-rc1/

*4. A staged Maven repository is available for review at:*
https://repository.apache.org/content/repositories/orgapachecarbondata-1066/

*5. Release artifacts are signed with the following key:*
https://people.apache.org/keys/committer/ajantha.asc


Please vote on releasing this package as Apache CarbonData 2.1.1,
The vote will be open for the next 72 hours and passes if a majority of at
least
three +1 PMC votes are cast.

[ ] +1 Release this package as Apache CarbonData 2.1.1

[ ] 0 I don't feel strongly about it, but I'm okay with the release

[ ] -1 Do not release this package because...


Regards,
Ajantha Bhat


Re: [DISCUSSION] Improve Simple insert performance in carbondata

2021-02-02 Thread Ajantha Bhat
Hi,

Simple insert you mean "insert by values"? I don't think in real data
pipeline this will be used frequently. Ideally insert will be used for
inserting from other table or external table.

Just for one row insert (or insert by values) I don't think we need to
avoid using spark Rdd flow. Also based on your design using sdk to write a
transactional table segment brings extra overhead of creating metadata
files manually. Considering the changes and it's value addition in the real
time scenario,

-1 from my side for this requirement.

Thanks,
Ajantha

On Tue, 2 Feb, 2021, 6:51 pm akshay_nuthala, 
wrote:

> Hi Community,
>
> As Carbon is closely integrated with spark, insert operations in carbon are
> done using spark API. This in turn fires spark jobs, which adds various
> overhead like task serialisation cost, extra memory consumption, execution
> time in remote nodes, shuffle etc.
>
> In case of simple insert operations - we can improve the performance by
> reusing SDK (which is plain java code) to achieve the same, thereby cutting
> off the overheads discussed above.
>
> Following is the link to the design document. Please give your valuable
> comments/inputs/suggestions.
>
>
> https://docs.google.com/document/d/1BcbTcO__vZbLLuhU73NIcbJOM2FRcKBa-ZxackofAS0/edit?usp=sharing
>
> Thanks,
>
> Regards,
> N Akshay Kumar
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Display the segment ID when carbondata load is successful

2021-01-17 Thread Ajantha Bhat
Hi Nihal,
In concurrent scenario we cannot map which load command has been loaded as
which segment id.
It is good to show the summary at the end of command.


I agree with david suggestion.
Along with load and insert, if possible we should give summary for update,
delete and merge also (which we may start supporting concurrent operations
in near future)


Thanks,
Ajantha

On Mon, 18 Jan, 2021, 9:49 am akashrn5,  wrote:

> Hi Nihal,
>
> The problem statement is not so clear, basically what is the use case, or
> in
> which scenario thee problem is faced. Because we need to get the result
> from
> the success segments itself. So please elaborate a little bit about the
> problem.
>
> Also, if you want to include more details, do not include in default show
> segments, may be can include in show segments with query, which likun had
> implemented. But this we can decide once its clear.
>
> Also, @vikram showing cache here is not a good idea, as we already have a
> command for that. If you are planning for segments wise, we can improve the
> existing cache specific commands, lets not include here.
>
> Thanks,
>
> Regards,
> Akash
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion]Presto Queries leveraging Secondary Index

2021-01-05 Thread Ajantha Bhat
Hi Venu,

a. *Presto carbondata support reading bloom index*, so I want to correct
your initial statement "Presto engine do not make use of
indexes(SI, Bloom etc) in query processing"

b. Between option1 and option2 the main difference is *option1 is
multi-threaded and option2 is distributed.*
The performance of the option1 will be bad. Hence even though we need spark
index server cluster (currently presto carbondata always need spark cluster
to write carbondata) *I want to go with option2.*

c. For option2, the implementation you cannot do like bloom as we need to
read the whole SI table with filter, so suggest to make a dataframe by
querying the SI table (which calls CarbonScanRDD) and once you get the
matched blocklets, make a split for main table from that based on block
level or blocklet level task distribution.

Thanks,
Ajantha

On Tue, Jan 5, 2021 at 5:31 PM VenuReddy  wrote:

> Hi all.!
>
> At present Carbon table queries with Presto engine do not make use of
> indexes(SI, Bloom etc) in query processing. Exploring feasible approaches
> without query plan rewrite to make use of secondary indexes(if any
> available) similar to that of existing datamap.
>
> *
> Option 1:
> * Presto get splits for main table to find the suitable SI table, scan, get
> the position references from SI table and return the splits for main table
> accordingly.
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. Within context of CarbondataSplitManager.getSplits() for main table, in
> CarbonInputFormat.getPrunedBlocklets(), we can do prune with new
> CoarseGrainIndex implementation for SI(similar to that of bloom). Inside
> Prune(), Identify the best suitable SI table, Use SDK CarbonReader to scan
> the identified SI table, get the position references to matching predicate.
> Need to think of reading the table in multiple threads.
> 3. Modify the filter expression to append positionId filter with obtained
> position references from SI table read.
> 4. In the context of CarbondataPageSource, create QueryModel with modified
> filter expression.
> Rest of the processing remains same as before.
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> 2. Can leverage SI with any execution engine.
> *DisAdvantages:*
> 1. Reading SI table in the context of CarbondataSplitManager.getSplits() of
> main table, possibly may degrade the query performance. Need to have enough
> resource to spawn multiple threads for reading within it.
>
> *
> Option 2:
> * Use Index Server to prune(enable distributed pruning).
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. On Index Server, during getSplits() for main table, in the context of
> DistributedPruneRDD.internalCompute()(i.e., on Index server executors)
> within pruneIndexes() can do prune with new CoarseGrainIndex implementation
> for SI(similar to that of bloom). Inside Prune(), Identify the best
> suitable
> SI table, Use CarbonReader to read the SI table, get the position
> references
> to matching predicate.
> 3. Return the extended blocklets for main table
> 4. Need to check how to return/transform filter expression to append
> positionId filter with position references which are read from SI table
> from
> Index Server to Driver along with pruned blocklets??
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> *DisAdvantages:*
> 1. Index Server Executors memory would be occupied for SI table reading.
> 2. Concurrent queries may have impact as Index server is used for SI table
> reading.
> 3. Index Server must be running.
>
> We can introduce a new Carbon property to switch between present and the
> new
> approach being proposed. We may consider the secondary index table storage
> file format change later.
>
> Please let me know your opinion/suggestion if we can go with Option-1 or
> Option-2 or both Option 1 + 2 or any other suggestion ?
>
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] Taking the inputs for Segment Interface Refactoring

2021-01-05 Thread Ajantha Bhat
Hi all,

As per the online meeting, I have thought through the design of the
transaction manager as well.
Transaction manager can be responsible for
a. Across table transaction --> expose start transaction, commit
transaction, rollback transaction to the user/application. Commit table
status file of all table once only if in all table the current transaction
successful.
b.  Table level versioning/MVCC for time travel, internally get the
transaction id (version id) for each table-level operations (DDL/DML) and
write multiple table status files for each version for time travel and also
keep one transaction file.

However, combining transactionManger with Segment interface refactoring
work will complicate things to design and handle in one PR. So, I want to
handle step by step,
*So, to handle segment interface refactoring first, Please go through the
document attached in the previous mail (also present in JIRA) and provide
your opinion (+1) to go ahead.*

Thanks,
Ajantha


On Fri, Nov 13, 2020 at 2:43 PM Ajantha Bhat  wrote:

> Hi Everyone.
> Please find the design of refactored segment interfaces in the document
> attached. Also can check the same V3 version attached in the JIRA [
> https://issues.apache.org/jira/browse/CARBONDATA-2827]
>
> It is based on some recent discussions and the previous discussions of
> 2018
> [
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Refactor-Segment-Management-Interface-td58926.html
> ]
>
> *Note:*
> 1) As  the pre-aggreage feature is not present and MV ,SI supports
> incremental loading. so, now the previous problem of commit all child
> table status at once maybe not applicable. so, removed interfaces for that.
> 2) All these will be developed in a new module called *carbondata-acid*
> and other required module depends on it.
> 3) Once this is implemented. we can discuss the design of time travel on
> top of it. [Transaction manager implementation and writing multiple table
> status files with versioning]
>
> Please go through it and give your inputs.
>
> Thanks,
> Ajantha
>
> On Mon, Oct 19, 2020 at 9:43 AM David CaiQiang 
> wrote:
>
>> I list feature list about segment as follows before starting to re-factory
>> segment interface.
>>
>> [table related]
>> 1. get lock for table
>>lock for tablestatus
>>lock for updatedTablestatus
>> 2. get lastModifiedTime of table
>>
>> [segment related]
>> 1. segment datasource
>>datasource: file format,other datasource
>>fileformat: carbon,parquet,orc,csv..
>>catalog type: segment, external segment
>> 2. data load etl(load/insert/add_external_segment/insert_stage)
>>write segment for batch loading
>>add external segment by using external folder path for mixed file
>> formatted table
>>append streaming segment for spark structed streaming
>>insert_stage for flink writer
>> 3. data query
>>segment properties and schema
>>segment level index cache and pruning
>>cache/refresh block/blocklet index cache if needed by segment
>>read segments to a dataframe/rdd
>> 4. segment management
>>new segment id for loading/insert/add_external_segment/insert_stage
>>create global segment identifier
>>show[history]/delete segment
>> 5. stats
>>collect dataSize and indexSize of the segment
>>lastModifiedTime, start/end time, update start/end time
>>fileFormat
>>status
>> 6. segment level lock for supporting concurrent operations
>> 7. get tablestatus storage factory
>>storage solution 1): use file system by default
>>storage solution 2): use hive metastore or db
>>
>> [table status related]:
>> 1. record new LoadMetadataDetails
>>  loading/insert/compatcion start/end
>>  add external segment start/end
>>  insert stage
>>
>> 2. update LoadMetadataDetails
>>   compation
>>   update/delete
>>   drop partition
>>   delete segment
>>
>> 3. read LoadMetadataDetails
>>   list all/valid/invalid segment
>>
>> 4. backup and history
>>
>> [segment file related]
>> 1. write new segment file
>>   generate segment file name
>>  better to use new timestamp to generate new segment file name for
>> each
>> writing. avoid overwriting segment file with same name.
>>write semgent file
>>merge temp segment file
>> 2. read segment file
>>readIndexFiles
>>readIndexMergeFiles
>>getPartitionSpec
>> 3. update segment file
>>update
>>merge index
>>drop partition
>>
>> [clean files related]
&g

[Discussion] Upgrade presto-sql to 333 version

2020-12-18 Thread Ajantha Bhat
Hi all,
Currently carbondata is integrated with presto-sql 316, which is 1.5 years
older.
There are many good features and optimization that came into presto like
dynamic filtering, Rubix data cache and some performance improvements.

It is always good to use latest version, latest version is presto-sql 348.
But jumping from 316 to 348 will be too many changes.
So, to utilize these new features and based on customer demand, I suggest
to upgrade presto-sql to 333 version.
Later it will be again upgraded to more latest version in few months.

The plain integration with presto333 is completed.
https://github.com/apache/carbondata/pull/4034
The deep integration to support new features like dynamic filtering, Rubix
data cache is under analysis, will be handled in another PR.

Thanks,
Ajantha


Re: [DISCUSSION] Geo spatial index algorithm improvement and UDFs enhancement

2020-12-17 Thread Ajantha Bhat
Hi Shen Jiayu,
It is an interesting feature, thanks for proposing this.

+1 from my side for high-level design,

I have few suggestions and questions.
a) Better to separate new UDF, utility UDF PR from algorithm improvement PR
for ease of review and maintainability.
b) Union, intersection, and diff of polygons can be computed during the
filter expression creation and can send the final polygon coordinates as
one range filter to carbon.
c) About algorithm improvement, I saw that you have removed a few
parameters like ‘minLongitude’, ‘maxLongitude’, ‘minLatitude’,
‘maxLatitude’. Anything else changed, can you describe more about what kind
of changes done to improve the algorithm?
d) Please capture the performance results due to algorithm changes with and
without these changes.
e) You have also mentioned supporting Geohash column from user during load.
This case no need to configure any spatial index properties in table
properties right ?

Thanks,
Ajantha

On Mon, Dec 14, 2020 at 9:18 PM haomarch  wrote:

> Hi Community,
>
> Now carbondata supports geo spatial index and one query UDF 'InPolygon'.
> We plan to optimize the Spatial index feature with three points:
>
> 1 reduce the parameters of table properties when creating geo table;
> 2 add more UDFs and support more complex query scenario;
> 3 allow user to define the spatial index when 'LOAD' and 'INSERT INTO', and
> carbon will still generated the value of spatial index column internally
> when user does not give.
>
>
> I have added an initial v1 design document 'CarbonData Spatial Index Design
> Doc.docx' and UDF interface design document 'Carbon Geo UDF Enhancement
> Interface Design.docx', please check and give comments/inputs/suggestions.
>
> CarbonData_Spatial_Index_Design_Doc.docx
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/CarbonData_Spatial_Index_Design_Doc.docx>
>
> Carbon_Geo_UDF_Enhancement_Interface_Design.docx
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/Carbon_Geo_UDF_Enhancement_Interface_Design.docx>
>
>
> Thanks,
>
> Regards,
> Shen Jiayu
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Size control of minot compaction

2020-11-23 Thread Ajantha Bhat
Hi Zhangshunyu, Thanks for providing more details on the problem.

If it is just for skipping history segments during auto minor compaction,
Adding a size threshold for minor compaction should be fine.
We can have a table level, dynamically configurable threshold.
If it is not configured, consider all the segments for merging. If
configured, consider the segments within the threshold value.

Thanks,
Ajantha

On Mon, Nov 23, 2020 at 5:26 PM Zhangshunyu  wrote:

> Yes, we need to support auto load merge for major compaction or size
> threshold limit for minor compaction.
> In many cases, the user use the minor compaction only want to merge small
> segments by time series (the num of segment is generated intime series),
> they dont want to merge big segment which is large enough.
>
>
>
> -
> My English name is Sunday
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Size control of minot compaction

2020-11-23 Thread Ajantha Bhat
Hi Zhangshunyu,

For this scenario specific cases, the user can use custom compaction by
mentioning the segment id which needs to be considered for compaction.

Also if you just want to do size based, major compaction can be used.

So, why are you thinking to support size based minor compaction? It will
basically lose the meaning of combining files based on number.

If you are using minor compaction for this scenario just because it
supports auto compaction, then may be we can check about supporting
"auto_compaction_type" = "minor/major"
option or the user can write some script to trigger major compaction
automatically.

Thanks,
Ajantha


On Mon, 23 Nov, 2020, 12:11 pm Zhangshunyu,  wrote:

> Hi dev,
> Currentlly, minor compaction only consider the num of segments and major
> compaction only consider the SUM size of segments, but consider a scenario
> that the user want to use minor compaction by the num of segments but he
> dont want to merge the segment whose datasize larger the threshold for
> example 2GB, as it is no need to merge so much big segment and it is time
> costly.
> so we need to add a parameter to control the threshold of segment included
> in minor compaction, so that the user can specify the segment not included
> in minor compaction once the datasize exeed the threshold, of course
> default
> value must be threre.
>
> So, what's your opinion about this?
>
>
>
> -
> My English name is Sunday
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION]Merge index property and operations improvement.

2020-11-22 Thread Ajantha Bhat
Hi Akash,
In point 3, you have mentioned no need to fail load if merge index fails,
So, how to create merge index again (as first-time query is slow without
merge index) If you block for new tables (as per point 2)? It is
contradicting I guess.

Here are my inputs for this,
*For Transactional tables*, As the merge index immediately deletes the
index files, concurrent queries can fail. So,

a) we can avoid exposing index files to the query (user), by making load
status success only after merge index created.
Also, update the table status file and segment file once after merge index
is created. no need to update with index file info before.
Also here keep maximum retry for table status file as this is the last
operation of load, failing here is costly to retry the whole load again.

b) After ensuring point a), If the merge index creation fails (which cannot
happen in most of the case), we can fail the load

c)  We still need to support the Alter table merge index command (mainly
required for old table upgrade scenario), no need to block for new tables.
when the user runs it, if index file doesn't exist (can know by reading
segment file), command can finish immediately and print a warning log that
no index files present to merge.

d) merge index carbon property (carbon.merge.index.in.segment), we can
directly remove it.


Thanks,
Ajantha

On Mon, Nov 9, 2020 at 1:56 PM Akash Nilugal  wrote:

> Hi All,
>
> Currently, we have the merge index feature which can be enabled or disabled
> and by default it's enabled.
> Now during load or compaction, we first create index files and then create
> merge index,
> if merge index generation fails we don't fail load, we have the alter
> compact command to do for unmerged
> index files.
>
> here are few things I want to suggest.
>
> 1. Deprecate the merge index property and keep for only for the developer
> purpose.
> 2. do not allow the alter compact merge index command for new table as
> already merge index is created and allow for only legacy tables.
>Alter merge index can be allowed only in the below conditions.
>a) when the update has happened on segment.
>b) when merge index creation failed during load or compaction.
> 3. Also no need to fail the load if the merge index fails(same as exiting
> behavior)
>
> Please suggest any modifications or any additions to this.
>
> Thanks
>
> Regards,
> Akash R Nilugal
>


Re: [ANNOUNCE] Ajantha as new PMC for Apache CarbonData

2020-11-20 Thread Ajantha Bhat
Thank you all !!

On Fri, 20 Nov, 2020, 1:45 pm manish gupta, 
wrote:

> Congratulations Ajantha 
>
> On Fri, 20 Nov 2020 at 1:21 PM, BrooksLi  wrote:
>
> > Congratulations to Ajantha!
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Discussion] Taking the inputs for Segment Interface Refactoring

2020-11-13 Thread Ajantha Bhat
Hi Everyone.
Please find the design of refactored segment interfaces in the document
attached. Also can check the same V3 version attached in the JIRA [
https://issues.apache.org/jira/browse/CARBONDATA-2827]

It is based on some recent discussions and the previous discussions of 2018
[
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Refactor-Segment-Management-Interface-td58926.html
]

*Note:*
1) As  the pre-aggreage feature is not present and MV ,SI supports
incremental loading. so, now the previous problem of commit all child
table status at once maybe not applicable. so, removed interfaces for that.
2) All these will be developed in a new module called *carbondata-acid* and
other required module depends on it.
3) Once this is implemented. we can discuss the design of time travel on
top of it. [Transaction manager implementation and writing multiple table
status files with versioning]

Please go through it and give your inputs.

Thanks,
Ajantha

On Mon, Oct 19, 2020 at 9:43 AM David CaiQiang  wrote:

> I list feature list about segment as follows before starting to re-factory
> segment interface.
>
> [table related]
> 1. get lock for table
>lock for tablestatus
>lock for updatedTablestatus
> 2. get lastModifiedTime of table
>
> [segment related]
> 1. segment datasource
>datasource: file format,other datasource
>fileformat: carbon,parquet,orc,csv..
>catalog type: segment, external segment
> 2. data load etl(load/insert/add_external_segment/insert_stage)
>write segment for batch loading
>add external segment by using external folder path for mixed file
> formatted table
>append streaming segment for spark structed streaming
>insert_stage for flink writer
> 3. data query
>segment properties and schema
>segment level index cache and pruning
>cache/refresh block/blocklet index cache if needed by segment
>read segments to a dataframe/rdd
> 4. segment management
>new segment id for loading/insert/add_external_segment/insert_stage
>create global segment identifier
>show[history]/delete segment
> 5. stats
>collect dataSize and indexSize of the segment
>lastModifiedTime, start/end time, update start/end time
>fileFormat
>status
> 6. segment level lock for supporting concurrent operations
> 7. get tablestatus storage factory
>storage solution 1): use file system by default
>storage solution 2): use hive metastore or db
>
> [table status related]:
> 1. record new LoadMetadataDetails
>  loading/insert/compatcion start/end
>  add external segment start/end
>  insert stage
>
> 2. update LoadMetadataDetails
>   compation
>   update/delete
>   drop partition
>   delete segment
>
> 3. read LoadMetadataDetails
>   list all/valid/invalid segment
>
> 4. backup and history
>
> [segment file related]
> 1. write new segment file
>   generate segment file name
>  better to use new timestamp to generate new segment file name for each
> writing. avoid overwriting segment file with same name.
>write semgent file
>merge temp segment file
> 2. read segment file
>readIndexFiles
>readIndexMergeFiles
>getPartitionSpec
> 3. update segment file
>update
>merge index
>drop partition
>
> [clean files related]
> 1. clean stale files for the successful  segment operation
>data deletion should delay a period of time(maybe query timeout
> interval), avoid deleting file immediately(beside of drop table/partition,
> force clean files)
>include data file, index file, segment file, tablestatus file
>impact operation: mergeIndex
> 2. clean stale files for failed segment operation immediately
>
>
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] About carbon.si.segment.merge feature

2020-11-10 Thread Ajantha Bhat
@David:
a) yes, SI can use global by default.
b) Handling SI original load itself to launch task based on SI segment size
(need to figure out how to estimate) is better,
else we have to go with one task per node logic (similar to main table
local sort). But current logic needs to changed to avoid small files
problem.
c) Refresh Index for SI is currently only for merging the small files, we
have to rename this command I think. Naming doesn't make sense.
and ReIndex is for loading the missed SI segments from main table, cannot
use it for merge.

@Akash:
a) Loading time difference between SI global_sort and local_sort is the
same as the Data loading difference of any table global sort and local
sort. we already have it.
b) yes, after implementing new SI load logic (task launch based on segment
size), we can compare current with refresh index time. If not much
difference we can remove refresh index support for SI.

Thanks,
Ajantha

On Mon, Nov 9, 2020 at 1:04 PM akashrn5  wrote:

> Hi,
>
> Its better to remove i feel, as lot of code will be avoided and we can do
> it
> right the first time we do it.
>
> but please consider below points.
>
> 1. may be once we can test the time difference of global sort and exiting
> local sort load time, may be per segment basis, so that we can have a
> overall time difference we can get in load, basically if we can note down
> the tradeoff time, that's better for future reference and in user
> perspective also.
>
> 2. Also can you check the refresh index and reload time diff, because we
> need to see if all users fine with dropping and recreating again.
>
> Regards,
> Akash
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION]Join optimization with Carbondata's metadata

2020-11-10 Thread Ajantha Bhat
Hi Akash,
*Just my opinion*, once the spark supports it, we can handle it in carbon
if something needs to be supported.
*Doing this change independent of spark can make us lose the advantage once
spark brings it as default. *

Qubole's dynamic filtering is already merged in prestosql and this will be
merged in spark also as it is beneficial.
So, Maybe we can first support spark 3.X with carbon (which will first
bring the DPP [dynmic partition pruning] optimization)
and handle dynamic filtering when spark supports it.


Thanks,
Ajantha

On Tue, Nov 10, 2020 at 3:28 PM akashrn5  wrote:

> please note below points addition to above
>
> 1. There is a jira in spark similar what i have raised,
>
> https://issues.apache.org/jira/browse/SPARK-27227
> they are also aimed at same, but its still in progress and target for spark
> 3.1.0.
> Here they plan to first execute a query on right table to get the min max,
> bloom index like that and
> apply to left, still the design in review, can go through once.
> We can look more deeper into it once.
>
> 2.
>
> https://www.qubole.com/blog/enhance-spark-performance-with-dynamic-filtering/
> This is also similar one but its in private version,
> So please consider this also.
>
> With the above info and our segmentinfo meta, or may be we do store in
> cache
> once we scan the small table. we can use that info to reduce scan for big
> table.
> As we still do not have spark 3 integration and still dynamic filtering is
> in design phase.
>
> Please give your inputs, we can discuss further.
>
> Thanks
>
> Akash R
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] About carbon.si.segment.merge feature

2020-11-06 Thread Ajantha Bhat
A small update on merge flow,
Currently, in local_sort SI merge, task launch is based on size, how many
carbon files is formed after the merge, that many tasks will be launched
for merge [CarbonSIRebuildRDD.internalGetPartitions].
Global_sort merge also implement identifying global_sort_partitions based
on how many carbon files is formed  after merge (similar to local sort
flow)

But we need to conclude on merge flow is really required or we can just
keep SI loading itself as 1 node 1 task logic [similar to our main table
local sort] and avoid the need for the merge operation.

Thanks,
Ajantha

On Fri, Nov 6, 2020 at 4:41 PM Ajantha Bhat  wrote:

> Hi,
>
> when a carbon property *carbon.si.segment.merge = true*,
>
> *a)  local_sort SI segment loading (default) [All the SI columns are
> involved]*
>
> SI load will load with default local_sort. There will be two times data
> loading, the first time is by querying the main table and creating the SI
> segment (here the number of tasks launched is equal to carbon files present
> in the main table segment), during these operations currently SI creates
> many small files.
> Then the merge operation will query the newly created SI segment and load
> data by local_sort again (here few tasks are launched, one node one task),
> so fewer files created.
>
> *>> So, we can optimize the first time SI segment creation itself to use
> one node one task logic and avoid creating small files and remove calling
> merge operation. with this, we can remove carbon.si.segment.merge property
> itself.*
> *b) global_sort SI segment loading [All the SI columns are involved]*
>
> SI load will load with a global sort. There will be two times data
> loading, first time is by querying the main table and creating SI segment
> (here the number of tasks launched (global_sort_partitions) is equal to
> carbon files present in the main table segment), during this operations
> currently SI creates many small files.
> Then the merge operation will query the newly created SI segment and load
> data by local sort again [there is no global sort logic presently] (here
> few tasks are launched, one node one task), but this will disorder the
> globally sorted data!
>
> *>> So, the user can configure global sort partition, but if the user
> didn't configure, code can use global_sort_partitions = number of active
> nodes and load the data to avoid creating the small files and remove
> calling merge operation. with this, we can remove carbon.si.segment.merge
> property itself.*
> *c) REFRESH INDEX   ON TABLE *
> If the user created the SI table in the previous version and has small
> files, can use this command to merge the small files. But if the user drops
> the index and creates it again, then no need for this command also [because
> merge and creating new SI takes a similar time]. So, do we need to support
> this command for the global sort?
> If we decide to retain the rebuild command then for global_sort, we need
> to add a new implementation as this command has only local sort code.
>
> Let me know your opinion on this.
>
> Thanks,
> Ajantha
>


Re: [DISCUSSION] Support MERGE INTO SQL API

2020-11-04 Thread Ajantha Bhat
+1,

Thanks for planning to implement this.

Please define the limitations or scope in more detail for WHEN MATCHED  and
WHEN NOT MATCHED.
For example, when  NOT MATCHED, can UPDATE also supported? (I guess only
insert is supported)

Thanks,
Ajantha

On Thu, Nov 5, 2020 at 8:10 AM BrooksLi  wrote:

> [Background]
> Currently, Carbondata do not have SQL command to support upsert
>
> [Movitation]
> Since we already have merge into dataset API, we can develop a MERGE INTO
> SQL API. Since the merge into command is a litter bit complex, it may need
> to develop a SQL parser with ANTLR to parse the SQL.
>
> MERGEINTO SQL COMMAND
>
> MERGE INTO [db_name.]target_table [AS target_alias]
> USING [db_name.]source_table [AS source_alias]
> ON 
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN NOT MATCHED [ AND  ]  THEN  ]
>
> MERGE INTO TAREGT
> USING SOURCE
> ON SOURCE.ID=TARGET.ID
> WHEN MATCHED THEN
> UPDATE SET TARGET.NAME = SOURCE.NAME
> WHEN NOT MATCHED THEN
> INSERT (TARGER.ID, TARGET.NAME, TARGET,AGE) VALUES ( SOURCE.ID,
> SOURCE.NAME,
> SOURCE.AGE)
>
> TARGET TABLE
> ID  Name Age
> 1   Jan   23
> 2   KK22
> 3   Joe   27
>
> SOURCE TABLE
> ID  NameAge
> 2   Steve   22
> 4   Mike24
>
>
> AFTER MERGE INTO COMMAND
> TARGET TABLE
> ID  NameAge
> 1   Jan 23
> 2   Steve   22
> 3   Joe 27
> 4   Mike24
>
> In the first version of implement, MERGE INTO SQL command in Carbondata
> will
> support basic table merge condition. And it will support up to 2 MATCHED
> clause, and 1 NOT MATCHED clause.
>
> In the send version, it can support the Carbondata feature, such as
> segment,
> time travel version etc.
>
>
> CarbonDataMergeIntoSQl.docx
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t492/CarbonDataMergeIntoSQl.docx>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 2.1.0(RC2) release

2020-11-04 Thread Ajantha Bhat
+1,

Thanks,
Ajantha

On Wed, 4 Nov, 2020, 2:17 pm akashrn5,  wrote:

> +1 for release.
>
> Thanks.
>
> Regards,
> Akash R Nilugal
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] Partition Optimization

2020-10-29 Thread Ajantha Bhat
+1,

Not keeping the partition values as a column (as the folder name already
has it) is a great way to reduce the store size.
we might have to handle compatibility and support refresh table also.

Apache Iceberg has a bit matured concept called *hidden partitioning, *where
they also maintain the relationship between columns and supports dynamic
rollup of partitions based on the query. You can analyze this (
https://iceberg.apache.org/partitioning/)

Thanks,
Ajantha

On Thu, Oct 15, 2020 at 2:22 AM Mahesh Raju Somalaraju <
maheshraju.o...@gmail.com> wrote:

> Dear Community,
>
> This mail is regarding partition optimization.
>
> *Current behaviour:* Currently partition column information is storing in
> data files after load/insert. When we query for partition data we are
> fetching from data files and filling the row.
>
> *Proposed optimization:* In this enhancement the idea is to remove/exclude
> partition column information while loading/insert[writing]. it means data
> files does not contain any partition column information. When we query for
> partition data[readers] fill the partition information with help from
> projection partiton columns[pass to BlockExecutionInfo and get it] and
> blockId[which has partition column name and value] and fill the row and
> return.
>
> *Benefits*:
> 1) query performance should be faster
> 2) store size should be less compare to old behavior.
>
> Please have a look *WIP PR[#1]* is raised for the same and we are working
> on CI failures currently.
>
> #1 https://github.com/apache/carbondata/pull/3695/
>
> Please provide your valuable inputs and suggestions. Thank you in advance !
>
> Thanks & Regards
> -Mahesh Raju Somalaraju
> github id: maheshrajus
>


Re: Will carbon support MERGE INTO sql?

2020-10-25 Thread Ajantha Bhat
Hi Zhangshunyu,
Yes, we are aware of this. As API support is already there, this was a
lower priority compared to other pending works like IUD performance
improvement and new feature implementations like time travel, segment
interface refactoring.

If you are interested to contribute on SQL syntax support for merge, you
are most welcome.
Else somebody will take up this in next version may be.

Thanks,
Ajantha


On Mon, 26 Oct, 2020, 9:19 am Zhangshunyu,  wrote:

> Hi dev,
> I see carbon now support merge into api, but not support MERGE INTO sql,
> take Delta Lake as an example:
>
> MERGE INTO [db_name.]target_table [AS target_alias]
> USING [db_name.]source_table [] [AS source_alias]
> ON 
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN MATCHED [ AND  ] THEN  ]
> [ WHEN NOT MATCHED [ AND  ]  THEN  ]
>
> Will carbon supprt this in future? any plan about this?
>
>
>
> -
> My English name is Sunday
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[Discussion] Taking the inputs for Segment Interface Refactoring

2020-10-14 Thread Ajantha Bhat
Hi Dev,
Multiple times we are discussing about segment interface refactoring. But
we are not moving ahead.
The final goal of this activity is to *design* *clean segment interface
that can support Time travel, concurrent operation and transaction
management. *

So, I am welcoming the problems, ideas & design for the same as many people
were having different ideas about it.
We can have a virtual design meeting if required for this.

Thanks,
Ajantha


Re: Parallel Insert and Update

2020-10-14 Thread Ajantha Bhat
Hi Kejian Li,
Thanks for working on this.

I see that this design and requirement is similar to what Nihal has
discussed a few days ago.
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-Parallel-compaction-and-update-td100338.html

So, Probably as Ravidra suggested for Nihal, maybe better to handle this by
segment interface refactoring for this also.

Thanks,
Ajantha

On Wed, Oct 14, 2020 at 2:46 PM Kejian Li <820972...@qq.com> wrote:

> Dear community,
>
> This mail is regarding parallel insert and update. Currently we are not
> supporting concurrent insert (or load data) and update because it may cause
> data inconsistency or incorrect result.
>
> Now Carbon blocks update operation when insert operation is in progress by
> throwing out concurrent operation exception directly. If there is an
> executing insert operation that is very time consuming, then update
> operation has to wait and sometimes this waiting time is very long.
>
> To come out with this problem, we are planning to support parallel insert
> and update. And here I have proposed one of the solutions to implement this
> feature.
>
> This is the design document:  Parallel_Insert_and_Update.pdf
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t495/Parallel_Insert_and_Update.pdf>
>
>
> Please go through this solution document and provide your input if this
> approach is okay or any drawback is there.
>
> Thanks & Regards
> Kejian Li
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANN] Indhumathi as new Apache CarbonData committer

2020-10-06 Thread Ajantha Bhat
Congratulations indhumathi.

On Wed, 7 Oct, 2020, 8:16 am Liang Chen,  wrote:

> Hi
>
>
> We are pleased to announce that the PMC has invited Indhumathi as new
>
> Apache CarbonData committer, and the invite has been accepted!
>
>
> Congrats to Indhumathi and welcome aboard.
>
>
> Regards
>
> The Apache CarbonData PMC
>


Re: [VOTE] Apache CarbonData 2.1.0(RC1) release

2020-10-04 Thread Ajantha Bhat
Hi,
Thanks for preparing the release.

*-1 from my side for this release package.*

reason is
a. Many PR's are yet to be merged [Example, Presto write PR #3875, #3916,
code cleanup #3950, other PR like #3934]
b. Please go through the key features again. Can add, SI global sort
support, presto complex type read support also
c. Many user-reported bugs are not fixed, need to fix it. it is reported in
the previous version like CARBONDATA-3905, CARBONDATA-3904, CARBONDATA-3954.
d. CI has random failures, need to fix it before the release.

Thanks,
Ajantha


On Sun, Oct 4, 2020 at 9:18 PM Kunal Kapoor 
wrote:

> Hi All,
>
> I submit the Apache CarbonData 2.1.0(RC1) for your vote.
>
>
> *1.Release Notes:*
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12347868==12320220=Create_token=A5KQ-2QAV-T4JA-FDED_e759c117bdddcf70c718e535d9f3cea7e882dda3_lout
>
> *Some key features and improvements in this release:*
>
>- Support Float and Decimal in the Merge Flow
>- Implement delete and update feature in carbondata SDK.
>- Support array with SI
>- Support IndexServer with Presto Engine
>- Insert from stage command support partition table.
>- Implementing a new Reindex command to repair the missing SI Segments
>- Support Change Column Comment
>
>  *2. The tag to be voted upon* : apache-carbondata-2.1.0-rc1
> 
>
> Commit: acef2998bcdd10204cdabf0dcdb123bbd264f48d
> <
> https://github.com/apache/carbondata/commit/acef2998bcdd10204cdabf0dcdb123bbd264f48d
> >
>
> *3. The artifacts to be voted on are located here:*
> https://dist.apache.org/repos/dist/dev/carbondata/2.1.0-rc1/
>
> *4. A staged Maven repository is available for review at:*
> https://repository.apache.org/content/repositories/orgapachecarbondata-1064
>
>
> Please vote on releasing this package as Apache CarbonData 2.1.0,
> The vote will be open for the next 72 hours and passes if a majority of at
> least three +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache CarbonData 2.1.0
>
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
>
> [ ] -1 Do not release this package because...
>
>
> Regards,
> Kunal Kapoor
>


Re: Regarding Carbondata Benchmarking & Feature presentation

2020-09-17 Thread Ajantha Bhat
Hi Vimal,

*We have archived the latest presentation in wiki now.*
https://cwiki.apache.org/confluence/display/CARBONDATA/Carbondata+2.0+Release+Meetup
Please check and let us know if any questions.

And regarding the performance report. The latest report takes some more
time. You can find old reports archived in the wiki.
Once the latest report is ready, we will share with you.
The summary of the performance test is,
TPCDS with basic table (no sort, no secondary index, no materialized view)
queries are in par or better with other formats.
With global sort, SI, MV, queries will perform much better.

Thanks,
Ajantha

On Thu, Sep 17, 2020 at 11:57 AM Ajantha Bhat  wrote:

> Hi, Thanks for planning to propose carbon.
>
> Please join our slack to directly discuss with members also.
>
> https://join.slack
> .com/t/carbondataworkspace/shared_invite/zt-g8sv1g92-pr3GTvjrW5H9DVvNl6H2dg
>
> we will get back to you on the presentations and benchmarks.
>
> Thanks,
> Ajantha
>
> On Thu, Sep 17, 2020 at 11:42 AM Vimal Das Kammath <
> vimaldas.kamm...@gmail.com> wrote:
>
>> Hi Carbondata Team,
>>
>> I am working on proposing Carbondata to the Data Analytics team in Uber.
>> It
>> will be great if any of you can share the latest benchmarking and
>> feature/design presentation.
>>
>> Regards,
>> Vimal
>>
>


Re: Regarding Carbondata Benchmarking & Feature presentation

2020-09-17 Thread Ajantha Bhat
Hi, Thanks for planning to propose carbon.

Please join our slack to directly discuss with members also.

https://join.slack
.com/t/carbondataworkspace/shared_invite/zt-g8sv1g92-pr3GTvjrW5H9DVvNl6H2dg

we will get back to you on the presentations and benchmarks.

Thanks,
Ajantha

On Thu, Sep 17, 2020 at 11:42 AM Vimal Das Kammath <
vimaldas.kamm...@gmail.com> wrote:

> Hi Carbondata Team,
>
> I am working on proposing Carbondata to the Data Analytics team in Uber. It
> will be great if any of you can share the latest benchmarking and
> feature/design presentation.
>
> Regards,
> Vimal
>


Re: Clean files enhancement

2020-09-15 Thread Ajantha Bhat
Hi vikram, Thanks for proposing this.

a) If the file system is HDFS, *HDFS already supports trash.*
when data is deleted in HDFS. It will be moved to trash instead of
permanent delete (can also configure trash interval *fs.trash.interval*)
b) If the file system is object storage like s3a or OBS. *They support
bucket versioning*. The user should configure it to go back to the previous
snapshot.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/undelete-objects.html

*So, Basically this functionality has to be there at underlying file system
not at CarbonData layer. *
Keeping trash folder with many configurations for this and checking aging
of the trash folder can work,
but it makes system complex and adds an additional overhead of maintaining
this functionality.

Based on this,
*-1 from my side for this feature*. you can wait for other people's
opinions on this before concluding.

Thanks,
Ajantha



On Thu, Sep 10, 2020 at 4:20 PM vikramahuja1001 
wrote:

> Hi all,
> This mail is regarding enhancing the clean files command.
> Current behaviour : Currently when clean files is called, the segments
> which
> are MARKED_FOR_DELETE or are COMPACTED are deleted and their entries are
> removed from tablestatus file, Fact folder and metadata/segments folder.
>
> Enhancement behaviour idea: In this enhancement the idea is to create a
> trash folder(like Recycle Bin, with 777 config) which can be stored in /tmp
> folder(or user defined folder, a new property will be exposed). Here when
> ever a segment is cleaned , the necessary carbondata files (no other files)
> can be copied to this folder. The RecycleBin folder can have a folder for
> each table with name like DBName_TableName. We can keep the carbondata
> files
> here for 3 days(or as long as the user wants, a carbon property will be
> exposed for the same.). They can be deleted if they are not modified since
> 3
> days or as per the property. We can maintain a thread which checks the
> aging
> time and deletes the necessary carbondata files from the trash folder.
>
> Apart from that, while cleaning INSERT_IN_PROGRESS segments will be cleaned
> too, but will try to get a segment lock before cleaning the
> INSERT_IN_PROGRESS segments. If the code is able to acquire the segment
> lock, i.e., it is a stale folder, it can be cleaned. If the code is not
> able
> to acquire the segment lock that means load is in progress or any other
> operation is in progress, in that case the INSERT_IN_PROGRESS segment will
> not be cleaned.
>
> Please provide input and suggestions for this enhancement idea.
>
> Thanks
> Vikram Ahuja
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] Improve the reading/writing performance on the big tablestatus file

2020-09-04 Thread Ajantha Bhat
Hi David,

a) Compressing table status is good. But need to check the decompression
overhead and how much overall benefit we can get.
b) I suggest we can keep multiple 10MB files (or configurable), then read
it distributed way.
c) Once read all the table status files better to cache them at driver with
multilevel hash map. [first level being status of the segment and second
level is segment id]

Thanks,
Ajantha

On Fri, Sep 4, 2020 at 10:19 AM akashrn5  wrote:

> Hi David,
>
> After discussing with you its little bit clear, let me just summarize in
> some lines
>
> *Goals*
> 1. reduce the size of status file (which reduces overall size wit some MBs)
> 2. make table status file less prone to failures, and fast reading during
> read
>
> *For the above goals with your solutions*
>
> 1. use the compressor, compress the table status file, so that during read
> inmemory read happens and
> it will faster
> 2. to make less prone to failure, *+1 for solution3* , which can combined
> with little bit of solution2 (for new format of table status and trace
> folder structure ) and solution3 of delta file, to make the read and write
> separate so that the read will be faster and it will help to avoid failures
> in case of reliability.
>
> Suggestion: One more point is to maintain the cache of details after forst
> read, instead of reading every time, only once the status-uuid is updated
> we
> can read again, till then we can read from cache, this will help in faster
> read and help in our query.
>
> I suggest you to create a *jira and prepare a design document*, there we
> can
> cover many impact areas and *avoid fixing small bugs after implementation.*
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] Segment management enhance

2020-09-04 Thread Ajantha Bhat
Hi David,

a) Recently we tested huge concurrent load and compactions but never faced
two loads using same segment id issue (because of table status lock in
recordNewLoadMetadata), so I am not sure whether we really need to update
to UUID.

b) And about other segment interfaces, we have to refactor it. It is long
pending. Refactor such that we can support TIME TRAVEL. I have to analyze
more on this. If somebody has already done some analysis can use thread to
refactor the segment interface discussion.

Thanks,
Ajantha

On Fri, Sep 4, 2020 at 1:11 PM Kunal Kapoor 
wrote:

> Hi David,
> Then better we keep a mapping for the segment UUID to virtual segment
> number in the table status file as well,
> Any API through which the user can get the segment details should return
> the virtual segment id instead of the UUID.
>
> On Fri, Sep 4, 2020 at 12:59 PM David CaiQiang 
> wrote:
>
> > Hi Kunal,
> >
> >1. The user uses SQL API or other interfaces. This UUID is a
> transaction
> > id, and we already stored the timestamp and other informations in the
> > segment metadata.
> >This transaction id can be used in the loading/compaction/update
> > operation. We can append this id into the log if needed.
> >Git commit id also uses UUID, so we can consider to use it. What
> > information do you want to get from the folder name?
> >
> >2. It is easy to fix the show segment command's issue. Maybe we can
> sort
> > segment by timestamp and UUID to generate the index id.  The user can
> > continue to use it in other commands.
> >
> >
> >
> > -
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Discussion] Update feature enhancement

2020-09-04 Thread Ajantha Bhat
Hi David. Thanks for proposing this.

*+1 from my side.*

I have seen users with 200K segments table stored in cloud.
It will be really slow to reload all the segments where update happened for
indexes like SI, min-max, MV.

So, it is good to write as a new segment
and just load new segment indexes. (try to reuse this flow
UpdateTableModel.loadAsNewSegment
= true)

and user can compact the segments to avoid many new segments created by
update.
and we can also move the compacted segments to table status history I guess
to avoid more entries in table status.

Thanks,
Ajantha



On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang  wrote:

> Hi Akash,
>
> 3. Update operation contain a insert operation.  Update operation will
> do the same thing how the insert operation process this issue.
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Slack workspace launch !

2020-08-04 Thread Ajantha Bhat
Hi all,
For the better discussion thread model and quick responses, we have created
a free slack workspace of Carbondata.

Feel free to join the workspace by using the below invite link and have
active discussions.

https://join.slack.com/t/carbondataworkspace/shared_invite/zt-g8sv1g92-pr3GTvjrW5H9DVvNl6H2dg

Currently, we maintain two channels in Carbondata workspace, *general* for
discussions and *troubleshooting* for issue troubleshooting.

[image: Screenshot from 2020-08-04 12-21-05.png]

Thanks,
Ajantha


Re: [Discussion] SI support Complex Array Type

2020-07-30 Thread Ajantha Bhat
Hi David & Indhumathi,
Storing Array of String as just String column in SI by flattening [with row
level position reference] can result in slow performance in case of
* Multiple array_contains() or multiple array[0] = 'x'
* The join solution mentioned can result in multiple scan (once for every
complex filter condition) which can slow down the SI performance.
* Row level SI can slow down SI performance when the filter results huge
value.
* To support multiple SI on a single table, complex SI will become row
level position reference and primitive will become blocklet level position
reference. Need extra logic /time for join.
* Solution 2 cannot support struct column SI in the future. So, it cannot
be a generic solution.

Considering the above points, *solution2 is a very good solution if only
one filter exist* for complex column. *But not a good solution for all the
scenarios.*

*So, I have to go with solution1 or need to wait for other people opinions
or new solutions.*

Thanks,
Ajantha

On Thu, Jul 30, 2020 at 1:19 PM David CaiQiang  wrote:

> +1 for solution2
>
> Can we support more than one array_contains by using SI join (like SI on
> primitive data type)?
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Presto+Carbon transactional and Non-transactional Write Support

2020-07-27 Thread Ajantha Bhat
+1,

I have some suggestions and questions,

a) you mentioned, currently creating a table form presto and inserting data
will be a non-transactional table.
so, to create a transactional table, we still depend on spark ?
*I feel we should support transactional table creation with all
table properties also from presto (to remove dependency on spark).  *

b) If spark has created global sort table and presto inserts data. what
will happen ? Do we ignore those table properties in presto write ?

c) Partition and complex type can be handled as immediate follow up of this
as it is a very common feature now a days.

Thanks,
Ajantha

On Mon, 27 Jul, 2020, 1:07 pm Kunal Kapoor, 
wrote:

> +1,
> It would be great to have write support from presto
>
> Thanks
> Kunal Kapoor
>
> On Tue, Jul 14, 2020 at 6:08 PM Akash Nilugal 
> wrote:
>
> > Hi Community,
> >
> > As we know the CarbonDataisan indexed columnar data format for fast
> > analytics on big data platforms. So
> > we have already integrated with the query engines like spark and even
> > presto. Currently with presto we
> > only support the querying of carbon data files. But we don’t yet support
> > the writing of carbon data files
> > through presto engine.
> >
> > Currentlypresto is integrated with carbondata for reading the carbondata
> > files via presto.
> > For this, we should be having the store already ready which may be
> written
> > carbon in spark and the table
> > should be hive metastore. So using carbondata connector we are able to
> read
> > the carbondata files. But we
> > cannot create a table or load the data to the table in presto. So it will
> > somewhat hectic job to read the
> > carbon files, by writing first with other engines.
> >
> > So here I will be trying to support the transactional load support in
> > presto integration for carbon.
> >
> > I have attached the design document in the Jira, please refer and any
> > suggestions or input is most welcome.
> >
> > https://issues.apache.org/jira/browse/CARBONDATA-3831
> >
> >
> > Regards,
> > Akash R.
> >
>


Re: [Disscussion] Change Default TimeStampFormat to yyyy-mm-dd hh:mm:ss.SSS

2020-07-15 Thread Ajantha Bhat
Hi,
I Need to check below points before concluding on it. If you already have
information on this, you can provide me.

1. About hive and spark default format; some place they mention upto 9
decimal precision. you mentioned 3 decimal precision.
so, which file of hive and spark has this default value ?
2. Why current test cases are not failing when we compare query results
from hive and carbon for timestamp column ?
3. Also after we change it, how many testcase need to modify? [because the
validation value may not match] ?

*As this value is configurable.* I am *neutral* about the changes proposed
If the effort is high.

Thanks,
Ajantha

On Tue, Jul 14, 2020 at 2:05 PM haomarch  wrote:

> Spark's default TimeStampFormat  is -mm-dd hh:mm:ss.SSS
> CarbonData shall keep consistent with Spark.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Disscuss] The precise of timestamp is limited to millisecond in carbondata, which is incompatiable with DB

2020-07-15 Thread Ajantha Bhat
+ 1,
as SimpleDateFormat doesn't support nanoseconds and microsecond.

Thanks,
Ajantha

On Tue, Jul 14, 2020 at 5:03 PM xubo245 <601450...@qq.com> wrote:

> +1, please consider compatable  for history data
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 2.0.0(RC3) release

2020-07-15 Thread Ajantha Bhat
Hi Justin,
Thanks for pointing it out about the "Copyright (c) 2017-2018 Uber
Technologies, Inc."

I see that *two test files* in *pycarbon* module of carbondata has it.
As pycarbon depends on open source Apache license *uber's petastrom*
project.
These two testcase files were imported from that project has this.

It is an error. we will remove it.

Thanks,
Ajantha

On Wed, Jul 15, 2020 at 7:01 AM Justin Mclean  wrote:

> Hi,
>
> I was taking a look your your release and noticed a couple of files with
> "Copyright (c) 2017-2018 Uber Technologies, Inc." in them, but this is not
> mentioned in your LICENSE file. Is there a reason for this?
>
> You might also consider fixing your NOTICE file to include the year of
> release. Copyright, while it lasts for some time, is not forever.
>
> Thanks,
> Justin
>


Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

2020-07-09 Thread Ajantha Bhat
Hi,
I didn't reply to deprecation. *+1 for deprecating it*.

*And +1 for issue fix also.*
Issue fix, I didn't mean when *carbon.merge.index.in
.segment = false.*

but when when *carbon.merge.index.in
.segment = true and merge index creation
failed for some reason.*
code needs to take care of
a. Moving index files from temp folder to final folder in case of partition
table load.
b. Not failing the current partition load.  (same as normal load behavior)
I think these two are not handled after partition optimization, you can
check and handle it.


Thanks,
Ajantha

On Thu, 9 Jul, 2020, 9:29 pm Akash r,  wrote:

> Hi,
>
> +1, we can deprecate it and as Vishal suggested we can keep as internal
> property for developer purpose.
>
> Regards,
> Akash R Nilugal
>
> On Thu, Jul 9, 2020, 2:46 PM VenuReddy  wrote:
>
> > Dear Community.!
> >
> > Have recently encountered a problem of Segment directory and segment file
> > in
> > metadata directory are not created for partitioned table when
> > 'carbon.merge.index.in.segment' property is set to 'false'. And actual
> > index
> > files which were present in respective partition's '.tmp' directory are
> > also
> > deleted without moving them out to respective partition directory where
> its
> > '.carbondata' file exist. Thus queries throw exception while reading
> index
> > files. Please refer jira issue -
> > https://issues.apache.org/jira/browse/CARBONDATA-3834
> > 
> >
> > To address this issue, we have 2 options to go with -
> > 1. Either fix it to work for 'carbon.merge.index.in.segment' set to
> false
> > case. There is an open PR
> https://github.com/apache/carbondata/pull/3776
> >    for it.
> >
> > 2. Or Deprecate the 'carbon.merge.index.in.segment' property itself. As
> > the
> > query performance is better when merge index files are in use when
> compared
> > to normal index files, and default behavior is to generate merge index
> > files, probably it is not necessary to support
> > 'carbon.merge.index.in.segment' anymore.
> >
> > What do you think about it ? Please let me know your opinion.
> >
> > Thanks,
> > Venu
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>
>


Re: [Discussion]Do we still need to support carbon.merge.index.in.segment property ?

2020-07-09 Thread Ajantha Bhat
Hi, What if too many index files in a segment and user want to finish load
fast and don't want to wait for merge index?

That time setting merge index = false can help to save load time and in off
peak time user can create merge index.

So I still feel we need to fix issue exist when merge index = false.

Thanks,
Ajantha

On Thu, 9 Jul, 2020, 3:05 pm David CaiQiang,  wrote:

> Better to always merge index.
>
> -1 for 1,
>
> +1 for 2,
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 2.0.1(RC1) release

2020-06-01 Thread Ajantha Bhat
+ 1

Regards,
Ajantha

On Mon, 1 Jun, 2020, 4:33 pm Kunal Kapoor,  wrote:

> Hi All,
>
> I submit the Apache CarbonData 2.0.1(RC1) for your vote.
>
>
> *1.Release Notes:*
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12347870
>
>  *2. The tag to be voted upon* :
> apache-carbondata-2.0.1-rc1
> 
>
> *3. The artifacts to be voted on are located here:*
> https://dist.apache.org/repos/dist/dev/carbondata/2.0.1-rc1/
>
> *4. A staged Maven repository is available for review at:*
>
> https://repository.apache.org/content/repositories/orgapachecarbondata-1063/
>
> *5. Release artifacts are signed with the following key:*
> https://people.apache.org/keys/committer/kunalkapoor.asc
>
> Please vote on releasing this package as Apache CarbonData 2.0.1,
> The vote will be open for the next 4 hours because this is a patch release
> and passes if a majority of at least three +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache CarbonData 2.0.1
>
> [ ] 0 I don't feel strongly about it, but I'm okay with the release
>
> [ ] -1 Do not release this package because...
>
>
> Regards,
> Kunal Kapoor
>


Re: [DISCUSSION] About global sort in 2.0.0

2020-05-31 Thread Ajantha Bhat
+1

We can have a minor version patch release.

Also in the next version, I suggest we can analyze existing testcases and
make them organized and stronger!

Thanks,
Ajantha

On Mon, 1 Jun, 2020, 9:12 am Kunal Kapoor,  wrote:

> +1
> We can have 2.0.1 as the patch release.
>
> Regards
> Kunal Kapoor
>
> On Mon, Jun 1, 2020 at 9:09 AM Jacky Li  wrote:
>
> > Hi All,
> >
> > In CarbonData version 2.0.0, there is a bug that making global-sort using
> > incorrect sort value when sorting column is String.
> > This is impacting all existing global-sort table when doing new loading
> or
> > insert into.
> >
> > So I suggest community should have a patch release to fix this bug ASAP.
> > For 2.0.0 version, global-sort on String column is not recommended to
> use.
> >
> > Regards,
> > Jacky
>


Re: [Discussion] Presto read support for complex data types

2020-05-25 Thread Ajantha Bhat
+ 1,
This is really required as complex schema is very common now a days and
most the user have it.

I see that current design covers only 1 level array.
Multi level array with complex children and other complex type also need to
be supported.
Now, that you have an idea about array. It is better to design for struct
and Map. So that you can come up with generic recursive methods to avoid
rework later.

Thanks,
Ajantha

On Fri, May 22, 2020 at 1:13 PM akshay_nuthala 
wrote:

> *Background*: This feature will enable Presto to read complex columns from
> carbondata file.
> Complex columns include - array, map and struct.
>
> NOTE: This design only handles array type. Map and struct data types will
> be
> handled later.
>
> Details of solution and implementation is mentioned in the document
> attached
> in JIRA.
> https://issues.apache.org/jira/browse/CARBONDATA-3830
>
> Thanks,
> Akshay
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Presto+Carbon transactional and Non-transactional Write Support

2020-05-25 Thread Ajantha Bhat
+ 1 for the proposal,
I didn't see design doc in JIRA. Please check.

Also once we provide write support, it is better to have carbondata as a
separate plugin instead of extending hive.
As presto-hive was not meant to have write support and it is mainly meant
for query at where it exist.

Also hope you cover what all table properties supported in write in the
design doc.

Thanks,
Ajantha




On Thu, May 21, 2020 at 8:57 PM Akash Nilugal 
wrote:

> Hi Community,
>
> As we know the CarbonDataisan indexed columnar data format for fast
> analytics on big data platforms. So
> we have already integrated with the query engines like spark and even
> presto. Currently with presto we
> only support the querying of carbon data files. But we don’t yet support
> the writing of carbon data files
> through presto engine.
>
> Currentlypresto is integrated with carbondata for reading the carbondata
> files via presto.
> For this, we should be having the store already ready which may be written
> carbon in spark and the table
> should be hive metastore. So using carbondata connector we are able to read
> the carbondata files. But we
> cannot create a table or load the data to the table in presto. So it will
> somewhat hectic job to read the
> carbon files, by writing first with other engines.
>
> So here I will be trying to support the transactional load support in
> presto integration for carbon.
>
> I have attached the design document in the Jira, please refer and any
> suggestions or input is most welcome.
>
> https://issues.apache.org/jira/browse/CARBONDATA-3831
>
>
> Regards,
> Akash R.
>


[Discussion] Support pagination in SDK reader

2020-05-20 Thread Ajantha Bhat
*Background: *Pagination is the task of dividing the query result into
pages and retrieving the required pages one by one on demand.
[Example is google search. It displays results in pages] In the database
domain, we use offset and limit to achieve it.
Now If carbondata is used to create an image dataset for model training for
ML. User wants to display the dataset on the web (which needs pagination
support)
Example: If the table has 500 rows and user needs 10 rows from 400 to 410.
offset = 400, limit = 10
But carbondata doesn't support it now as it is iterative based read.

Details of solution and implementation is mentioned in the document
attached and in JIRA.

https://issues.apache.org/jira/browse/CARBONDATA-3829

Thanks,
Ajantha


Re: [VOTE] Apache CarbonData 2.0.0(RC3) release

2020-05-17 Thread Ajantha Bhat
+1

Regards,
Ajantha



On Sun, 17 May, 2020, 6:41 pm Jacky Li,  wrote:

> +1
>
> Regards,
> Jacky
>
>
> > 2020年5月17日 下午4:50,Kunal Kapoor  写道:
> >
> > Hi All,
> >
> > I submit the Apache CarbonData 2.0.0(RC3) for your vote.
> >
> >
> > *1.Release Notes:*
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346046=Html=12320220
> >
> > *Some key features and improvements in this release:*
> >
> >   - Adapt to SparkSessionExtensions
> >   - Support integration with spark 2.4.5
> >   - Support heterogeneous format segments in carbondata
> >   - Support write Flink streaming data to Carbon
> >   - Insert from stage command support partition table.
> >   - Support secondary index on carbon table
> >   - Support query of stage files
> >   - Support TimeBased Cache expiration using ExpiringMap
> >   - Improve insert into performance and decrease memory foot print
> >   - Support PyTorch and TensorFlow
> >
> > *2. The tag to be voted upon* : apache-carbondata-2.0.0-rc3
> > 
> >
> > Commit: 29d78b78095ad02afde750d89a0e44f153bcc0f3
> > <
> https://github.com/apache/carbondata/commit/29d78b78095ad02afde750d89a0e44f153bcc0f3
> >
> >
> > *3. The artifacts to be voted on are located here:*
> > https://dist.apache.org/repos/dist/dev/carbondata/2.0.0-rc3/
> >
> > *4. A staged Maven repository is available for review at:*
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1062/
> >
> > *5. Release artifacts are signed with the following key:*
> > https://people.apache.org/keys/committer/kunalkapoor.asc
> >
> >
> > Please vote on releasing this package as Apache CarbonData 2.0.0,
> > The vote will be open for the next 72 hours and passes if a majority of
> at
> > least three +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache CarbonData 2.0.0
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> > [ ] -1 Do not release this package because...
> >
> >
> > Regards,
> > Kunal Kapoor
>
>


Re: [Discussion] Optimize the Update Performance

2020-05-13 Thread Ajantha Bhat
Hi !,
Update is still using converter step with bad record handing.

If it is update by dataframe scenario no need of bad record handling,
only for update by value case we can keep it.

This can give significant improvement as we already observed in insert flow.

I tried once to send it to new insert into flow. But because of implicit
column. Plan rearrange failed.
I didn't continue this because of other work.
May be I have to look into it again and see if it can work.

Thanks,
Ajantha

On Thu, May 14, 2020 at 9:51 AM haomarch  wrote:

> I have serveral ideas to optimize the update performance:
> 1. Reduce the storage size of tupleId:
>The tupleId is too long leading heavily shuffle IO overhead while join
> change table with target table.
> 2. Avoid to convert String to UTF8String in the row processing.
>Before write rows into delta files, The convertfrom string to UTFString
> hamers some performance
>Code: "UTF8String.fromString(row.getString(tupleId))"
> 3. For DELETE ops in the MergeDataCommand, we shouldn't joint the whole
> columns of change table take part in the JOIN ops. Only the "key" column is
> needed.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 2.0.0(RC2) release

2020-05-03 Thread Ajantha Bhat
-1 to this RC,

1. I feel we need to have clear interface changes from the previous version
to this version in the release notes.
Example: PR #3583  has
changed the SDK 'Field' package name
2. We need to list down all the removed / deprecated features in the
release notes.
3. Better to recheck and update the key feature section.

Thanks,
Ajantha

On Sat, May 2, 2020 at 8:53 PM xubo245 <601450...@qq.com> wrote:

> -1!
>
> Why  PyCarbon isn't key features and improvements ?
>
> PyCarbon: provide python interface for users to use CarbonData by python
> code
>
> https://issues.apache.org/jira/browse/CARBONDATA-3254
>
> Including:
> 1.PySDK: provide python interface to read and write CarbonData
> 2.Integrating deep learning framework TensorFlow
> 3.Integrating deep learning framework PyTorch
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Carbon over-use cluster resources

2020-04-15 Thread Ajantha Bhat
Hi Manhua,

For only No sort and Local sort, we don't follow spark task launch logic.
we have our own logic of one node one task. And inside that task we can
control resource by configuration (carbon.number.of.cores.while.loading)

As you pointed in the above mail, *N * C is controlled by configuration*
and the default value of C is 2.
*I see over use cluster problem only if you configure it badly.*

Do you have any suggestion to the change design? Feel free to raise a
discussion and work on it.

Thanks,
Ajantha

On Tue, Apr 14, 2020 at 6:06 PM Liang Chen  wrote:

> OK, thank you feedbacked this issue, let us look into it.
>
> Regards
> Liang
>
>
> Manhua Jiang wrote
> > Hi All,
> > Recently, I found carbon over-use cluster resources. Generally the design
> > of carbon work flow does not act as common spark task which only do one
> > small work in one thread, but the task has its mind/logic.
> >
> > For example,
> > 1.launch carbon with --num-executors=1 but set
> > carbon.number.of.cores.while.loading=10;
> > 2.no_sort table with multi-block input, N Iterator
> > 
> >  for example, carbon will start N tasks in parallel. And in each task the
> > CarbonFactDataHandlerColumnar has model.getNumberOfCores() (let's say C)
> > in ProducerPool. Totally launch N*C threads; ==>This is the case makes me
> > take this as serious problem. To many threads stucks the executor to send
> > heartbeat and be killed.
> >
> > So, the over-use is related to usage of threadpool.
> >
> > This would affect the cluster overall resource usage and may lead to
> wrong
> > performance results.
> >
> > I hope this get your notice while fixing or writing new codes.
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [VOTE] Apache CarbonData 2.0.0(RC1) release

2020-04-02 Thread Ajantha Bhat
Hi,
For rc1, my comment is : -1

Similar points as Liang but along with that, After #3661, many
documentation link is broken for MV, bloom, lucene datamap from ReadMe.md
We need to fix it soon before the carbondata 2.0.0 release.

Thanks,
Ajantha

On Thu, Apr 2, 2020 at 4:26 PM Liang Chen  wrote:

> Hi
>
> Thanks for preparing 2.0.0.
> For rc1, my comment is : -1 (binding)
> The following of open issues should be considerred in 2.0.0:
>
> https://github.com/apache/carbondata/pull/3675
> https://github.com/apache/carbondata/pull/3687
> https://github.com/apache/carbondata/pull/3682
> https://github.com/apache/carbondata/pull/3691
> https://github.com/apache/carbondata/pull/3689
> https://github.com/apache/carbondata/pull/3686
> https://github.com/apache/carbondata/pull/3683
> https://github.com/apache/carbondata/pull/3676
> https://github.com/apache/carbondata/pull/3690
> https://github.com/apache/carbondata/pull/3688
> https://github.com/apache/carbondata/pull/3639
> https://github.com/apache/carbondata/pull/3659
> https://github.com/apache/carbondata/pull/3669
>
> Regards
> Liang
>
> kunalkapoor wrote
> > Hi All,
> >
> > I submit the Apache CarbonData 2.0.0(RC1) for your vote.
> >
> >
> > *1.Release Notes:*
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12346046
> >
> > *Some key features and improvements in this release:*
> >
> >- Adapt to SparkSessionExtensions
> >- Support integration with spark 2.4.5
> >- Support heterogeneous format segments in carbondata
> >- Support write Flink streaming data to Carbon
> >- Insert from stage command support partition table.
> >- Support secondary index on carbon table
> >- Support query of stage files
> >- Support TimeBased Cache expiration using ExpiringMap
> >- Improve insert into performance and decrease memory foot print
> >
> >  *2. The tag to be voted upon* : apache-carbondata-2.0.0-rc1
> > 
> https://github.com/apache/carbondata/tree/apache-carbondata-2.0.0-rc1;
> >
> > Commit: a906785f73f297b4a71c8aaeabae82ae690fb1c3
> > 
> https://github.com/apache/carbondata/commit/a906785f73f297b4a71c8aaeabae82ae690fb1c3
> ;
> > )
> >
> > *3. The artifacts to be voted on are located here:*
> > https://dist.apache.org/repos/dist/dev/carbondata/2.0.0-rc1/
> >
> > *4. A staged Maven repository is available for review at:*
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1060/
> >
> > *5. Release artifacts are signed with the following key:*
> > https://people.apache.org/keys/committer/kunalkapoor.asc
> >
> >
> > Please vote on releasing this package as Apache CarbonData 2.0.0,  The
> > vote will
> > be open for the next 72 hours and passes if a majority of at least three
> > +1
> > PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache CarbonData 2.0.0
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> > [ ] -1 Do not release this package because...
> >
> >
> > Regards,
> > Kunal Kapoor
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Disable Adaptive encoding for Double and Float by default

2020-03-26 Thread Ajantha Bhat
Hi,

*I was able to decrease the memory usage in TLAB from 68GB to 29.94 GB for
the same TPCH data* *without disabling adaptive encoding*.

*There is about 5% improvement in insert also*. Please check the PR.

https://github.com/apache/carbondata/pull/3682

Before the change:
[image: Screenshot from 2020-03-26 16-45-12]
<https://user-images.githubusercontent.com/5889404/77640947-380c0e80-6f81-11ea-97ff-f1b8942d99c6.png>

After the change:

[image: Screenshot from 2020-03-26 16-51-31]
<https://user-images.githubusercontent.com/5889404/77641533-34c55280-6f82-11ea-8a60-bfb6c8d8f52a.png>

Thanks,

Ajantha


On Wed, Mar 25, 2020 at 2:51 PM Ravindra Pesala 
wrote:

> Hi Anantha,
>
> I think it is better to fix the problem instead of disabling the things. It
> is already observed that store size increases proportionally. If my data
> has more columns then it will be exponential.  Store size directly impacts
> the query performance in object store world. It is better to find a way to
> fix it rather than removing things.
>
> Regards,
> Ravindra.
>
> On Wed, 25 Mar 2020 at 5:04 PM, Ajantha Bhat 
> wrote:
>
> > Hi Ravi, please find the performance readings below.
> >
> > On TPCH 10GB data, carbon to carbon insert in on HDFS standalone cluster:
> >
> >
> > *By disabling adaptive encoding for float and double.*
> > insert is *more than 10% faster* [before 139 seconds, after this it is
> > 114 seconds] and
> > *saves 25% memory in TLAB*store size *has increased by 10% *[before 2.3
> > GB, after this it is 2.55 GB]
> >
> > Also we have below check. If data is more than 5 decimal precision. we
> > don't apply adaptive encoding for double/float.
> > So, I am not sure how much it is useful for real-world double precision
> > data.
> >
> > [image: Screenshot from 2020-03-25 14-27-07.png]
> >
> >
> > *Bottleneck is finding that decimal points from every float and double
> > value [*PrimitivePageStatsCollector.getDecimalCount(double)*] *
> > *where we convert to string and use substring().*
> >
> > so I want to disable adaptive encoding for double and float by default.
> >
> > Thanks,
> > Ajantha
> >
> > On Wed, Mar 25, 2020 at 11:37 AM Ravindra Pesala 
> > wrote:
> >
> >> Hi ,
> >>
> >> It increases the store size.  Can you give me performance figures with
> and
> >> without these changes.  And also provide how much store size impact if
> we
> >> disable it.
> >>
> >>
> >> Regards,
> >> Ravindra.
> >>
> >> On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat 
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I have done insert into flow profiling using JMC with the latest code
> >> > [with new optimized insert flow]
> >> >
> >> > It seems for *2.5GB* carbon to carbon insert, double and float stats
> >> > collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
> >> > buffer)]
> >> >
> >> > [image: Screenshot from 2020-03-25 11-18-04.png]
> >> > *The problem is for every value of double and float in every row, we
> >> call *
> >> > *PrimitivePageStatsCollector.getDecimalCount()**Which makes new
> objects
> >> > every time.*
> >> >
> >> > So, I want to disable Adaptive encoding for float and double by
> default.
> >> > *I will make this configurable.*
> >
> >
> >> > If some user has a well-sorted double or float column and wants to
> apply
> >> > adaptive encoding on that, they can enable it to reduce store size.
> >> >
> >> > Thanks,
> >> > Ajantha
> >> >
> >> --
> >> Thanks & Regards,
> >> Ravi
> >>
> > --
> Thanks & Regards,
> Ravi
>


Re: Disable Adaptive encoding for Double and Float by default

2020-03-25 Thread Ajantha Bhat
Hi Ravi, please find the performance readings below.

On TPCH 10GB data, carbon to carbon insert in on HDFS standalone cluster:


*By disabling adaptive encoding for float and double.*
insert is *more than 10% faster* [before 139 seconds, after this it is 114
seconds] and
*saves 25% memory in TLAB*store size *has increased by 10% *[before 2.3 GB,
after this it is 2.55 GB]

Also we have below check. If data is more than 5 decimal precision. we
don't apply adaptive encoding for double/float.
So, I am not sure how much it is useful for real-world double precision
data.

[image: Screenshot from 2020-03-25 14-27-07.png]


*Bottleneck is finding that decimal points from every float and double
value [*PrimitivePageStatsCollector.getDecimalCount(double)*] *
*where we convert to string and use substring().*

so I want to disable adaptive encoding for double and float by default.

Thanks,
Ajantha

On Wed, Mar 25, 2020 at 11:37 AM Ravindra Pesala 
wrote:

> Hi ,
>
> It increases the store size.  Can you give me performance figures with and
> without these changes.  And also provide how much store size impact if we
> disable it.
>
>
> Regards,
> Ravindra.
>
> On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat 
> wrote:
>
> > Hi all,
> >
> > I have done insert into flow profiling using JMC with the latest code
> > [with new optimized insert flow]
> >
> > It seems for *2.5GB* carbon to carbon insert, double and float stats
> > collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
> > buffer)]
> >
> > [image: Screenshot from 2020-03-25 11-18-04.png]
> > *The problem is for every value of double and float in every row, we
> call *
> > *PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
> > every time.*
> >
> > So, I want to disable Adaptive encoding for float and double by default.
> > *I will make this configurable.*
> > If some user has a well-sorted double or float column and wants to apply
> > adaptive encoding on that, they can enable it to reduce store size.
> >
> > Thanks,
> > Ajantha
> >
> --
> Thanks & Regards,
> Ravi
>


Disable Adaptive encoding for Double and Float by default

2020-03-24 Thread Ajantha Bhat
Hi all,

I have done insert into flow profiling using JMC with the latest code [with
new optimized insert flow]

It seems for *2.5GB* carbon to carbon insert, double and float stats
collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
buffer)]

[image: Screenshot from 2020-03-25 11-18-04.png]
*The problem is for every value of double and float in every row, we call *
*PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
every time.*

So, I want to disable Adaptive encoding for float and double by default.
*I will make this configurable.*
If some user has a well-sorted double or float column and wants to apply
adaptive encoding on that, they can enable it to reduce store size.

Thanks,
Ajantha


Re: Propose to upgrade hive version to 3.1.0

2020-02-21 Thread Ajantha Bhat
+1,

The current version will still be supported or carbondata will only support
3.1.0 after this?

Thanks,
Ajantha

On Fri, 21 Feb, 2020, 4:39 pm Kunal Kapoor, 
wrote:

> Hi All,
>
> The hive community has already released version 3.1.0 which has a lot of
> bug fixes and new features.
> Many of the users have already migrated to 3.1.0 in their production
> environment and i think its time we should also upgrade the hive-carbon
> integration to this version.
>
> Please go through the release notes
> <
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343014=Text=12310843
> >
> for the list of improvements and bug fixes.
>
> Regards
> Kunal Kapoor
>


Re: Discussion: change default compressor to ZSTD

2020-02-19 Thread Ajantha Bhat
Hi Jacky and Ravindra,

we have tested ZSTD vs snappy again with the latest code in 3 node spark
2.3 cluster on HDFS with TPCH 500 GB data.
Below is the summary

*1.  ZSTD store is 28.8% smaller compared to snappy*
*2.  Overall query time is degraded by 18.35% in ZSTD compared to snappy*
*3.  Load time in ZSTD has negligible degradation of 0.7 % compared to
snappy*

Based on this, I guess we cannot use ZSTD as default due to huge
degradation in query time.

Thanks,
Ajantha




On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala 
wrote:

> Hi Jacky,
>
> As per the original PR
> https://github.com/apache/carbondata/pull/2628 , query performance got
> decreased by 20% ~ 50% compared to snappy.  So I am concerned about the
> performance. Please better have a proper tpch performance report on the
> regular cluster like we do for every version and decide based on that.
>
> Regards,
> Ravindra.
>
> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li  wrote:
>
> > Hi Ajantha,
> >
> >
> > Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> > from the file header,
> > but I think it is better to put it in the name so that user can know the
> > compressor in the shell without reading it by launching engine.
> >
> >
> > In spark, for parquet/orc the file name written
> > is:part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
> >
> >
> > In PR3606, I will handle the compatibility.
> >
> >
> > Regards,
> > Jacky
> >
> >
> > --原始邮件--
> > 发件人:"Ajantha Bhat" > 发送时间:2020年2月6日(星期四) 晚上11:51
> > 收件人:"dev" >
> > 主题:Re: Discussion: change default compressor to ZSTD
> >
> >
> >
> > Hi,
> >
> > 33% is huge a reduction in store size. If there is negligible difference
> in
> > load and query time, we should definitely go for it.
> >
> > And does user really need to know about what compression is used ? change
> > in file name may be need to handle compatibility.
> > Already thrift *FileHeader, ChunkCompressionMeta* is storing the
> compressor
> > name. query time decoding can be based on this.
> >
> > Thanks,
> > Ajantha
> >
> >
> > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li  >
> >  Hi,
> > 
> > 
> >  I compared snappy and zstd compressor using TPCH for carbondata.
> > 
> > 
> >  For TPCH lineitem table:
> >  carbon-zstdcarbon-snappy
> >  loading (s)5351
> >  size795MB1.2GB
> > 
> >  TPCH-query:
> >  Q14.2898.29
> >  Q212.60912.986
> >  Q314.90214.458
> >  Q46.2765.954
> >  Q523.14721.946
> >  Q61.120.945
> >  Q723.01728.007
> >  Q814.55415.077
> >  Q928.47227.473
> >  Q1024.06724.682
> >  Q113.3213.79
> >  Q125.3115.185
> >  Q1314.0811.84
> >  Q142.2622.087
> >  Q155.4964.772
> >  Q1629.91929.833
> >  Q177.0187.057
> >  Q1817.36717.795
> >  Q192.9312.865
> >  Q2011.34710.937
> >  Q2126.41628.414
> >  Q225.9236.311
> >  sum283.844290.704
> > 
> > 
> >  As you can see, after using zstd, table size is 33% reduced
> comparing
> > to
> >  snappy. And the data loading and query time difference is
> negligible.
> > So I
> >  suggest to change the default compressor in carbondata from snappy
> to
> > zstd.
> > 
> > 
> >  To change the default compressor, we need to:
> >  1. append the compressor name in the carbondata file name. So that
> > from
> >  the file name user can know what compressor is used.
> >  For example, file name will be changed from
> >  nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> > 
> >
> tonbsp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> > 
> > ornbsp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> > 
> > 
> >  2. Change the compressor constant in CarbonCommonConstaint.java file
> > to
> >  use zstd as default compressor
> > 
> > 
> >  What do you think?
> > 
> > 
> >  Regards,
> >  Jacky
>
> --
> Thanks & Regards,
> Ravi
>


Re: Improving show segment info

2020-02-16 Thread Ajantha Bhat
3. And about event time. I don't think we need to keep it for every row. It
is a waste of storage size. can we keep in loadMetadetails or file level ?

On Mon, Feb 17, 2020 at 11:10 AM Ajantha Bhat  wrote:

> Hi Likun,
>
> I think this display command is hard to maintain if we provide all these
> options manually.
>
> *1. How about creating a "tableName.segmentInfo" child table for each main
> table?* user can query this table and easy to support filter, group by.
> we just have to finalize the schema of this table.
>
> 2. For each partition to find out which all the segments it is mapped to,
> currently we don't store this information anywhere. so, where are you
> planning to store it? I don't think we need to calculate it every time.
>
> Thanks,
> Ajantha
>
>
> On Sun, Feb 16, 2020 at 10:48 PM akashrn5  wrote:
>
>> Hi,
>>
>> >I got your point, but given the partition column by user does not help
>> reducing the information. If we want to reduce the >amount of the
>> information, we should ask user to give the filter on partition column
>> like
>> example 3 in my original mail.
>>
>> 1. my concern was if there are more partition column and instead of
>> partition value filter, i was thinking of having filter on partition
>> column
>>
>>
>> >Do you mean skip the partition columns in the SHOW SEGMENTS result? For
>> example 3.
>> Same as above point 1.
>>
>> >No, by showing example 1, actually I want to change the default output of
>> the SHOW SEGMENTS to those 6 fields only >in example 1.
>> >I suggest having both `spent` and `throughput` so that user does not need
>> to calculate himself.
>>
>> I agree, then what about the old default infos which we have now, like
>> data
>> size, index size, mergedTo,format? All these infos will be moved to *DESC
>> SEGMENT 2 ON table1*, if i am not wrong?
>>
>> Regards,
>> Akash R Nilugal
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>
>


Re: Improving show segment info

2020-02-16 Thread Ajantha Bhat
Hi Likun,

I think this display command is hard to maintain if we provide all these
options manually.

*1. How about creating a "tableName.segmentInfo" child table for each main
table?* user can query this table and easy to support filter, group by. we
just have to finalize the schema of this table.

2. For each partition to find out which all the segments it is mapped to,
currently we don't store this information anywhere. so, where are you
planning to store it? I don't think we need to calculate it every time.

Thanks,
Ajantha


On Sun, Feb 16, 2020 at 10:48 PM akashrn5  wrote:

> Hi,
>
> >I got your point, but given the partition column by user does not help
> reducing the information. If we want to reduce the >amount of the
> information, we should ask user to give the filter on partition column like
> example 3 in my original mail.
>
> 1. my concern was if there are more partition column and instead of
> partition value filter, i was thinking of having filter on partition column
>
>
> >Do you mean skip the partition columns in the SHOW SEGMENTS result? For
> example 3.
> Same as above point 1.
>
> >No, by showing example 1, actually I want to change the default output of
> the SHOW SEGMENTS to those 6 fields only >in example 1.
> >I suggest having both `spent` and `throughput` so that user does not need
> to calculate himself.
>
> I agree, then what about the old default infos which we have now, like data
> size, index size, mergedTo,format? All these infos will be moved to *DESC
> SEGMENT 2 ON table1*, if i am not wrong?
>
> Regards,
> Akash R Nilugal
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Regarding presto carbondata integration

2020-02-11 Thread Ajantha Bhat
Hi all,

Currently master code of carbondata works with *prestodb 0.217*
We all know about competing *presto-sql* also.
Some of the users doesn't want to migrate to *presto-sql *as their cloud
vendor doesn't support presto sql (Example, AWS EMR, Huawei MRS, AZURE
services except HDInsights still comes with *presto db*)

So,
1. carbondata need to support both of them ?
2. carbondata need to maintain two modules ? one for prestodb, one for
prestosql, may be need to extract common code (big effort)
3. At a time carbondata can support only version of prestodb and
presto-sql. Every 15 days they release version and our integration is not
based on SPI (not as stand alone connector), we extended hive connector
interface. so, every few releases, carbondata and presto integration code
need to modify. This can be a bigger problem for maintenance.

And this is about read support, when we handle write support need to take
care about all the above points.

Thanks,
Ajantha


Re: Discussion: change default compressor to ZSTD

2020-02-06 Thread Ajantha Bhat
Hi,

33% is huge a reduction in store size. If there is negligible difference in
load and query time, we should definitely go for it.

And does user really need to know about what compression is used ? change
in file name may be need to handle compatibility.
Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
name. query time decoding can be based on this.

Thanks,
Ajantha


On Thu, Feb 6, 2020 at 4:27 PM Jacky Li  wrote:

> Hi,
>
>
> I compared snappy and zstd compressor using TPCH for carbondata.
>
>
> For TPCH lineitem table:
> carbon-zstdcarbon-snappy
> loading (s)5351
> size795MB1.2GB
>
> TPCH-query:
> Q14.2898.29
> Q212.60912.986
> Q314.90214.458
> Q46.2765.954
> Q523.14721.946
> Q61.120.945
> Q723.01728.007
> Q814.55415.077
> Q928.47227.473
> Q1024.06724.682
> Q113.3213.79
> Q125.3115.185
> Q1314.0811.84
> Q142.2622.087
> Q155.4964.772
> Q1629.91929.833
> Q177.0187.057
> Q1817.36717.795
> Q192.9312.865
> Q2011.34710.937
> Q2126.41628.414
> Q225.9236.311
> sum283.844290.704
>
>
> As you can see, after using zstd, table size is 33% reduced comparing to
> snappy. And the data loading and query time difference is negligible. So I
> suggest to change the default compressor in carbondata from snappy to zstd.
>
>
> To change the default compressor, we need to:
> 1. append the compressor name in the carbondata file name. So that from
> the file name user can know what compressor is used.
> For example, file name will be changed from
> part-0-0_batchno0-0-0-1580982686749.carbondata
> topart-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> orpart-0-0_batchno0-0-0-1580982686749.zstd.carbondata
>
>
> 2. Change the compressor constant in CarbonCommonConstaint.java file to
> use zstd as default compressor
>
>
> What do you think?
>
>
> Regards,
> Jacky


Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

2020-01-14 Thread Ajantha Bhat
+1,

Can you explain more about how are you encoding and storing min max in
segment file?As minmax values represent user data, we cannot store as plain
values. Storing encrypted min max will add overhead of encrypting and
decrypting.
I suggest we can convert segment file to thrift file to solve this. Other
suggestions are welcome.

Thanks,
Ajantha

On Tue, 14 Jan, 2020, 4:37 pm Indhumathi,  wrote:

> Hello all,
>
> In Cloud scenarios, index is too big to store in SparkDriver, since VM may
> not have so much memory.
> Currently in Carbon, we will load all indexes to cache for first time
> query.
> Since Carbon LRU Cache does
> not support time-based expiration, indexes will be removed from cache based
> on LeastRecentlyUsed mechanism,
> when the carbon lru cache is full.
>
> In some scenarios, where user's table has more segments and if user queries
> only very few segments often, we no
> need to load all indexes to cache. For filter queries, if we prune and load
> only matched segments to cache,
> then driver's memory will be saved.
>
> For this purpose, I am planing to add block minmax to segment metadata file
> and prune segment based on segment files and
> load index only for matched segment. As part of this, will add a
> configurable carbon property '*carbon.load.all.index.to.cache*'
> to allow user to load all indexes to cache if needed. BY default, value
> will
> be true.
>
> Currently, for each load, we will write a segment metadata file, while
> holds
> the information about indexFile.
> During query, we will read each segmentFile for getting indexFileInfo and
> then we will load all datamaps for the segment.
> MinMax data will be encoded and stored in segment file.
>
> Any suggestions/inputs from the community is appreciated.
>
> Thanks
> Indhumathi
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Optimize and refactor insert into command

2020-01-01 Thread Ajantha Bhat
Hi sujith,

I still keep converter step for some scenarios like insert from parquet to
carbon, we need an optimized converter here to convert from timestamp long
value (divide by 1000) and convert null values of direct dictionary to 1.
So, for the scenarios you mentioned, I will be using this flow with
optimized converter.

For carbon to carbon insert with same source and destination
properties(this is common scenario in cloud migration) , it goes to no
converter step and use direct spark internal row till write step.
Compaction also can use this no converter step.

Thanks,
Ajantha

On Thu, 2 Jan, 2020, 12:18 am sujith chacko, 
wrote:

> Hi Ajantha,
>
>Thanks for your initiative, I have couple of questions even though.
>
> a) As per your explanation the dataset validation is already done as part
> of the source table, this is what you mean? What I understand is the insert
> select queries are going to get some benefits since we don’t do some
> additional steps.
>
> What about if your destination table has some different table properties
> like few columns may have non null properties or date format or decimal
> precision’s or scale may be different.
> So you may need a bad record support then  , how you are going to handle
> such scenarios? Correct me if I misinterpreted your points.
>
> Regards,
> Sujith
>
>
> On Fri, 20 Dec 2019 at 5:25 AM, Ajantha Bhat 
> wrote:
>
> > Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
> > Load process has steps like parsing and converter step with bad record
> > support.
> > Insert into doesn't require these steps as data is already validated and
> > converted from source table or dataframe.
> >
> > Some identified changes are below.
> >
> > 1. Need to refactor and separate load and insert at driver side to skip
> > converter step and unify flow for No sort and global sort insert.
> > 2. Need to avoid reorder of each row. By changing select dataframe's
> > projection order itself during the insert into.
> > 3. For carbon to carbon insert, need to provide the ReadSupport and use
> > RecordReader (vector reader currently doesn't support ReadSupport) to
> > handle null values, time stamp cutoff (direct dictionary) from scanRDD
> > result.
> > 4. Need to handle insert into partition/non-partition table in local
> sort,
> > global sort, no sort, range columns, compaction flow.
> >
> > The final goal is to improve insert performance by keeping only required
> > logic and also decrease the memory footprint.
> >
> > If you have any other suggestions or optimizations related to this let me
> > know.
> >
> > Thanks,
> > Ajantha
> >
>


Re: Apply to open 'Issues' tab in Apache CarbonData github

2019-12-23 Thread Ajantha Bhat
If planning to issues tab just to replace mailing list problems. I would
suggest we can start using "*slack*".

Many companies and open source communities uses slack (I have used from
presto sql community). It supports thread based conversations and searching
is easy. It also provides option to create multiple channels and it works
in china without any vpn.

Please have a look at it once.

Thanks,
Ajantha

On Mon, 23 Dec, 2019, 11:31 pm 恩爸, <441586...@qq.com> wrote:

> Hi Liang:
>  Carbondata users can raise issues in github's issues to ask any
> questions, the function is the same as mailing list, many chinese users
> can't access mailing list and are used to use github's issues.
>  To track and record real issues, it still needs to use Apache JIRA.
>
>
> --Original--
> From:"Liang Chen [via Apache CarbonData Dev Mailing List archive]"<
> ml+s1130556n88594...@n5.nabble.com;
> Date:Sun, Dec 22, 2019 09:14 PM
> To:"恩爸"<441586...@qq.com;
>
> Subject:Re: Apply to open 'Issues' tab in Apache CarbonData github
>
>
>
> Hi
>
> +1 from my side.
> One question : what issues should be raised to Apache JIRA? what issues
> will
> be raised to github's issue ?
> It is better to give the clear definition.
>
> Regards
> Liang
>
>
> xm_zzc wrote
>  Hi community:
>   I suggest community to open 'Issues' tab in carbondata github
> page, we
>  can use this feature to collect the information of carbondata users,
> like
>  this:
> https://github.com/apache/incubator-shardingsphere/issues/234;,
>  users can add company information which uses carbondata on it
> willingly
>  and we can add these info in Carbondata website.
>   what do you think about this?
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
> If you reply to this email, your message will be
> added to the discussion below:
>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Apply-to-open-Issues-tab-in-Apache-CarbonData-github-tp88436p88594.html
>
> To start a new topic under Apache CarbonData Dev
> Mailing List archive, email ml+s1130556n1...@n5.nabble.com
> To unsubscribe from Apply to open 'Issues' tab in Apache
> CarbonData github, click here.
> NAML


Optimize and refactor insert into command

2019-12-19 Thread Ajantha Bhat
Currently carbondata "insert into" uses the CarbonLoadDataCommand itself.
Load process has steps like parsing and converter step with bad record
support.
Insert into doesn't require these steps as data is already validated and
converted from source table or dataframe.

Some identified changes are below.

1. Need to refactor and separate load and insert at driver side to skip
converter step and unify flow for No sort and global sort insert.
2. Need to avoid reorder of each row. By changing select dataframe's
projection order itself during the insert into.
3. For carbon to carbon insert, need to provide the ReadSupport and use
RecordReader (vector reader currently doesn't support ReadSupport) to
handle null values, time stamp cutoff (direct dictionary) from scanRDD
result.
4. Need to handle insert into partition/non-partition table in local sort,
global sort, no sort, range columns, compaction flow.

The final goal is to improve insert performance by keeping only required
logic and also decrease the memory footprint.

If you have any other suggestions or optimizations related to this let me
know.

Thanks,
Ajantha


Re: [DISCUSSION]Support for Geospatial indexing

2019-11-27 Thread Ajantha Bhat
Hi Venu,

1. Please keep the default implementation independent of grid size and
other parameters.
I mean below parameters.
'INDEX_HANDLER.xxx.gridSize',
'INDEX_HANDLER.xxx.minLongitude',
'INDEX_HANDLER.xxx.maxLongitude',
'INDEX_HANDLER.xxx.minLatitude',
'INDEX_HANDLER.xxx.maxLatitude',

*It should work on just longitude , latitude. index type and float data
type as default longitude and latitude. *
*Quadtree* logic can be generic, which takes geohash id and  return ranges.
Can work for all implementations.

Can add custom implementation for gridsize and other parameters if required.

2. In describe formatted table, Instead of non-schema columns, can show it
as Custom Index Information.
And better to show the custom index handler name and source columns used
also in describe.

*# Custom Index Information*

*custom index Handler Class :*

*custom index Handler type:*
*custom index column name : *

*custom index column data type : *
*custom index source columns :*

we can skip display itself if property is not configured.

Thanks,
Ajantha



On Tue, Nov 26, 2019 at 8:38 PM VenuReddy  wrote:

> Hi all,
>
> I've refreshed the design document in jira. Have incorporated changes to
> table properties and fixed review comments.
> Please find  the latest design doc at
> https://issues.apache.org/jira/browse/CARBONDATA-3548
> Request review and let me know your opinion.
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

2019-11-24 Thread Ajantha Bhat
+1 ,

As we have already worked on it, we have to integrate it as clean as
possible.

I think this can be done by 2 layers.

1. *PySDK:* Generic python layer over java SDK. Users who doesn't need AI
support but just python SDK layer can use just this.
a. This supports read, write carbondata files (like java SDK). *We can
have a document to mention what all API we support.*
b. This layer also supports building Arrow carbon reader which is
supported by java SDK. Here we read carbon files and fill the in memory
arrow vector.
This is used by PyCarbon layer.

2. *PyCarbon:* This layer will be responsible for integrating carbondata
with AI engines like TensorFlow , MXNet, PyTorch to provide AI scenarios
like epoch and shuffle.
As *Uber's petastorm (open source Apache license project)* supports all
above scenarios by using arrow format and also already carbondata can write
to arrow  vector from SDK (#3193). So integration is easy and
we just have to add dependency of petastorm in carbondata project.
*I suggest we take the latest version of petastorm now [v0.77]*
*We can have a design document to mention how this is done and what all
the interface we support from pycarbon.*

Thanks,
Ajantha


On Sun, Nov 24, 2019 at 11:17 AM Jacky Li  wrote:

> +1
>
> Great proposal, thanks for contributing
>
> Regards,
> Jacky
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION]Support for Geospatial indexing

2019-10-24 Thread Ajantha Bhat
Hi Jacky,

we have checked about geomesa

[image: Screenshot from 2019-10-23 16-25-23.png]

a. Geomesa is tightly coupled with  key-value pair databases like Accumulo,
HBase, Google Bigtable and Cassandra databases and used for OLTP queries.
b. Geomesa current spark integration is only in query flow, load from spark
is not supported. spark can be used for analytics on geomesa store.
Here they override spark catalyst optimizer code to intercept filter from
logical relation and they push down to geomesa server.
All the query logic like spatial time curve building (z curve, quadtree)
doesn't happen at spark layer. It happens in geoserver layer which is
coupled with key-value pair databases.
https://www.geomesa.org/documentation/user/architecture.html

https://www.geomesa.org/documentation/user/spark/architecture.html

https://www.youtube.com/watch?v=Otf2jwdNaUY

c. Geomesa is for spatio-temporal data , not just a spatial data.
so, we cannot integrate carbon with  geo mesa directly, but we can reuse
some of the logic present in it like quadtree formation and look up.

Also I found *another alternative* "*GeoSpark", *this project is not
coupled with any store.
https://datasystemslab.github.io/GeoSpark/

https://www.public.asu.edu/~jiayu2/presentation/jia-icde19-tutorial.pdf
so, we will check further about integrating carbon to GeoSpark or reusing
some of the code from this.

Also regarding the second point, yes, we can have carbon implementation as
a generic framework where we can plugin the different logic.

Thanks,
Ajantha





On Mon, Oct 21, 2019 at 6:34 PM Indhumathi  wrote:

> Hi Venu,
>
> I have some questions regarding this feature.
>
> 1. Does geospatial index supports on streaming table?. If so, will there be
> any impact on generating
> geoIndex on steaming data?
> 2. Does it have any restrictions on sort_scope?
> 3. Apart from Point and Polygon queries, will geospatial index also support
> Aggregation queries on
> geographical location data?
>
> Thanks & Regards,
> Indhumathi
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Ajantha as new Apache CarbonData committer

2019-10-03 Thread Ajantha Bhat
Thank you all.

On Thu, 3 Oct, 2019, 6:47 PM Kunal Kapoor,  wrote:

> Congratulations ajantha
>
> On Thu, Oct 3, 2019, 5:30 PM Liang Chen  wrote:
>
> > Hi
> >
> >
> > We are pleased to announce that the PMC has invited Ajantha as new Apache
> > CarbonData committer and the invite has been accepted!
> >
> > Congrats to Ajantha and welcome aboard.
> >
> > Regards
> >
> > Apache CarbonData PMC
> >
>


Re: [DISCUSSION] Support Time Series for MV datamap and autodatamap loading of timeseries datamaps

2019-09-30 Thread Ajantha Bhat
+ 1 ,

I have some suggestions and questions.

1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
'timeseries_column'.
 so that it won't give an impression that only time stamp datatype is
supported and update the document with all the datatype supported.

2. Querying on this datamap table is also supported right ? supporting
changing plan for main table to refer datamap table is for user to avoid
changing his query or any other reason ?

3. If user has not created day granularity datamap, but just created hour
granularity datamap. When query has day granularity, data will be fetched
form hour granularity datamap and aggregated ? or data is fetched from main
table ?

Thanks,
Ajantha

On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal 
wrote:

> Hi xuchuanyin,
>
> Thanks for the comments/Suggestions
>
> 1. Preaggregate is productized, but not the timeseries with preaggregate,
> i think you  got confused with that, if im right.
> 2. Limitations like, auto sampling or rollup, which we will be supporting
> now. Retention policies. etc
> 3. segmentTimestampMin, this i will consider in design.
> 4. RP is added as a separate task, i thought instead of maintaining two
> variables better to maintabin one and parse it. But i will consider your
> point based on feasibility during implementation.
> 5. We use an accumulator which takes list, so before writing index files
> we take the min max of the timestamp column and fill in accumulator and
> then we can access accumulator.value in driver after load is finished.
>
> Regards,
> Akash R Nilugal
>
> On 2019/09/28 10:46:31, xuchuanyin  wrote:
> > Hi akash, glad to see the feature proposed and I have some questions
> about
> > this. Please notice that some of the following descriptions are comments
> > followed by '===' described in the design document attached in the
> > corresponding jira.
> >
> > 1.
> > "Currently carbondata supports timeseries on preaggregate datamap, but
> its
> > an alpha feature"
> > ===
> > It has been some time since the preaggregate datamap was introduced and
> it
> > is still **alpha**, why it is still not product-ready? Will the new
> feature
> > also come into the similar situation?
> >
> > 2.
> > "there are so many limitations when we compare and analyze the existing
> > timeseries database or projects which supports time series like apache
> druid
> > or influxdb"
> > ===
> > What are the actual limitations? Besides, please give an example of this.
> >
> > 3.
> > "Segment_Timestamp_Min"
> > ===
> > Suggest using camel-case style like 'segmentTimestampMin'
> >
> > 4.
> > "RP is way of telling the system, for how long the data should be kept"
> > ===
> > Since the function is simple, I'd suggest using 'retentionTime'=15 and
> > 'timeUnit'='day' instead of 'RP'='15_days'
> >
> > 5.
> > "When the data load is called for main table, use an spark accumulator to
> > get the maximum value of timestamp in that load and return to the load."
> > ===
> > How can you get the spark accumulator? The load is launched using
> > loading-by-dataframe not using global-sort-by-spark.
> >
> > 6.
> > For the rest of the content, still reading.
> >
> >
> >
> >
> > --
> > Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: Question about Presto integration

2019-07-15 Thread Ajantha Bhat
Hi Yuya Ebihara,
As you can see in our documentation, latest carbondata is currently
integrated with presto 0.217.
https://github.com/apache/carbondata/blob/master/docs/presto-guide.md

So, presto 0.217 works fine with his query.
Presto 0.219 is not yet supported in this version of carbondata.

Thanks,
Ajantha


On Fri, Jul 12, 2019 at 6:12 AM Yuya Ebihara  wrote:

> Hi Carbondata team,
>
> Could you please help this issue about Presto integration?
> https://github.com/prestodb/presto/issues/12913
>
> (I sent an email to u...@carbondata.apache.org first, but I couldn't send
> it correctly)
>
> BR,
> Yuya Ebihara
>


Re: [Discussion] Migrate CarbonData to support PrestoSQL

2019-05-06 Thread Ajantha Bhat
+1 for carbondata to support prestoSQL.
As prestoSQL community is very active and prestoDB is more restricted to
only facebook driven changes.

However we need to take a call now, whether carbondata needs to supports
both prestoDB and prestoSQL in future?
I think supporting only prestoSQL is good enough as users are migrating to
prestoSQL
and prestodb can still use older version of carbondata.

Thanks,
Ajantha



On Mon, May 6, 2019 at 6:26 PM Naman Rastogi 
wrote:

> As we all know, *Presto Software Foundation (presto sql) *has been formed
> recently in Jan 2019. and is currently very active in taking and
> implementing many open source features like support for Hive 3.0, Hadoop
> 3.2.0. Support for Hive 3.1 is also in progress.
> Now, old Presto DB is only used by Facebook.
>
> So, it is better for CarbonData to support Presto SQL instead of Presto DB.
>
> I have raised a PR https://github.com/apache/carbondata/pull/3205 to
> migrate CarbonData code to support Presto SQL 310 instead of Presto DB
> 0.217.
>
> Any suggestions/opinions from the community is greatly appreciated.
>
> Regards
> Naman Rastogi
>


Support Apache arrow vector filling from carbondata SDK

2019-05-02 Thread Ajantha Bhat
*Background:*
As we know Apache arrow is a cross-language development platform for
in-memory data, It specifies a standardised language-independent columnar
memory format for flat and hierarchical data, organised for efficient
analytic operations on modern hardware.
So, By integrating carbon to support filling arrow vector, contents read by
carbondata files can be used for analytics in any programming language. say
arrow vector filled from carbon java SDK can be read by python, c, c++ and
many other languages supported by arrow.
This will also increase the scope for carbondata use-cases and carbondata
can be used for various applications as arrow is integrated already with
many query engines.
*Implementation:*
*Stage1:*
After SDK reading the carbondata file, convert carbon rows and fill the
arrow vector.
*Stage2:*
Deep integration with carbon vector; for this, currently carbon SDK vector
doesn't support filling complex columns.
After supporting this, arrow vector can be wrapped around carbon SDK vector
for deep integration.

For stage1, please find the PR below.
https://github.com/apache/carbondata/pull/3193

Thanks,
Ajantha


Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

2018-12-17 Thread Ajantha Bhat
@Liang: yes, your understanding of my proposal is correct.
Why remove empty sort_columns? if user specifies empty sort columns, I
should throw an exception saying sort_columns specified not present?
I feel no need to remove empty sort columns, by default we set sort_columns
as empty sort_columns internally.

@xuchuanyin: yes, that's all. But I also want to change
CarbonCommonConstants.LOAD_SORT_SCOPE_DEFAULT, because if some place if
sort_scope is displayed or addressed without referring sort_columns. I want
to show default as NO_SORT

@david: I will check about this use case and development scope of this
version. If required, I will do it in a separate PR.

Thanks,
Ajantha

On Mon, Dec 17, 2018 at 7:17 AM David CaiQiang  wrote:

> Better to support alter 'sort_columns' and 'sort_scope' also.
>
> After the table creation and data loading, the user can adjust
> 'sort_columns' and 'sort_scope'.
>
>
>
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

2018-12-11 Thread Ajantha Bhat
Hi all,
Currently in carbondata, we have 'local_sort' as default sort_scope and by
default, all the dimension columns are selected for sort_columns.
This will slow down the data loading.
*To give the best performance benefit to user by default values, *
we can change sort_scope to 'no_sort' and stop using all dimensions for
sort_columns by default.
Also if sort_columns are specified but sort_scope is not specified by the
user, implicitly need to consider scort_scope as 'local_sort'.
These default values are applicable for carbonsession, spark file format
and SDK also. (all will have the same behavior)

With these changes below is the performance results of TPCH queries on
500GB data



** Load time is improved nearly by 4 times. * total Query time by all
queries is improved. (50% of queries are faster with no_sort, other 50%
queries are slightly degraded or same. overall better performance)*
Also when I did this change, I found few major issues from existing code in
'no_sort' and empty sort_columns flow. I have fixed that also.
Below are the issues found,




*[CARBONDATA-3162] Range filters don't remove null values for no_sort
direct dictionary dimension columns. [CARBONDATA-3163] If table has
different time format, for no_sort columns data goes as bad record (null)
for second table when loaded after first table.[CARBONDATA-3164] During
no_sort, exception happened at converter step is not reaching to user. same
problem in SDK and spark file format flow also.Also fixed multiple test
case issues.*
I have already opened a PR for fixing these issues.
https://github.com/apache/carbondata/pull/2966

Let me know if any suggestions about these changes.

Thanks,
Ajantha


[carbondata-presto enhancements] support reading carbon SDK writer output in presto

2018-12-09 Thread Ajantha Bhat
Currently, carbon SDK files output (files without metadata folder and its
contents) are read by spark using an external table with carbon session.
But presto carbon integration doesn't support that. It can currently read
only the transactional table output files.

Hence we can enhance presto to read SDK output files. This will increase
the use cases for presto-carbon integration.

The above scenario can be achieved by inferring schema if metadata folder
not exists and
setting read committed scope to LatestFilesReadCommittedScope, if
non-transctional table output files are present.


Thanks,
Ajantha


Re: [DISCUSSION] Support DataLoad using Json for CarbonSession

2018-12-05 Thread Ajantha Bhat
Hi,
+1 for the JSON proposal in loading.
This can help in nested level complex data type loading.
Currently, CSV loading supports only 2 level delimiter. JSON loading can
solve this problem.

While supporting JSON for SDK, I have already handled your point 1)  and 3)
you can refer and use the same.
"org.apache.carbondata.processing.loading.jsoninput.{*JsonInputFormat,
JsonStreamReader*}"
"org.apache.carbondata.processing.loading.parser.impl.*JsonRowParser*"

yes, regarding point 2) you have to implement the iterator. While doing
this, try support reading JSON and CSV files together in a folder.
Can give CSV files to CSV iterator and JSON files to JSON iterator and
support together loading.

Also for insert into by select flow, you can always send it to JSON flow by
making loadModel.isJsonFileLoad() always true in
AbstractDataLoadProcessorStep,
so that nested complex type data insert into / CTAScan be supported.

Also, I suggest you to create a JIRA for this and add a design document
there.
In document mention about what load options are newly supported for this
(like record_identifier to identify multiline spanned JSON data) also.

Thanks,
AB








On Wed, Dec 5, 2018 at 3:54 PM Indhumathi  wrote:

> Hello All,
>
> I am working on supporting data load using JSON file for CarbonSession.
>
> 1. Json File Loading will use JsonInputFormat.The JsonInputFormat will read
> two types of JSON formatted data.
> i).The default expectation is each JSON record is newline delimited. This
> method is generally faster and is backed by the LineRecordReader you are
> likely familiar with.
> This will use SimpleJsonRecordReader to read a line of JSON and return it
> as
> a Text object.
> ii).The other method is 'pretty print' of JSON records, where records span
> multiple lines and often have some type of root identifier.
> This method is likely slower, but respects record boundaries much like the
> LineRecordReader. User has to provide the identifier and set
> "json.input.format.record.identifier".
> This will use JsonRecordReader to read JSON records from a file. It
> respects
> split boundaries to complete full JSON records, as specified by the root
> identifier.
> JsonStreamReader handles byte-by-byte reading of a JSON stream, creating
> records based on a base 'identifier'.
>
> 2. Implement JsonRecordReaderIterator similar to CSVRecordReaderIterator
>
> 3. Use JsonRowParser which will convert jsonToCarbonRecord and generate a
> Carbon Row.
>
> Please feel free to provide your comments and suggestions.
>
> Regards,
> Indhumathi M
>
>
>
>
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [proposal] Parallelize block pruning of default datamap in driver for filter query processing.

2018-11-22 Thread Ajantha Bhat
@xuchuanyin
Yes, I will be handling this for all types of datamap pruning in the same
flow when I am done with default datamap's implementation and testing.

Thanks,
Ajantha



On Fri, Nov 23, 2018 at 6:36 AM xuchuanyin  wrote:

> 'Parallelize pruning' is in my plan long time ago, nice to see your
> proposal
> here.
>
> While implementing this, I'd like you to make it common, that is to say not
> only default datamap but also other index datamaps can also use parallelize
> pruning.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[proposal] Parallelize block pruning of default datamap in driver for filter query processing.

2018-11-20 Thread Ajantha Bhat
Hi all,
I want to propose *"Parallelize block pruning of default datamap in driver
for filter query processing"*

*Background:*
We do block pruning for the filter queries at the driver side.
In real time big data scenario, we can have millions of carbon files for
one carbon table.
It is currently observed that for 1 million carbon files it takes around 5
seconds for block pruning. As each carbon file takes around 0.005ms for
pruning
(with only one filter columns set in 'column_meta_cache' tblproperty).
If the files are more, we might take more time for block pruning.
Also, spark Job will not be launched until block pruning is completed. so,
the user will not know what is happening at that time and why spark job is
not launching.
currently, block pruning is taking time as each segment processing is
happening sequentially. we can reduce the time by parallelizing it.


*solution:*Keep default number of threads for block pruning as 4.
User can reduce this number by a carbon property
"carbon.max.driver.threads.for.pruning" to set between -> 1 to 4.

In TableDataMap.prune(),

group the segments as per the threads by distributing equal carbon files to
each thread.
Launch the threads for a group of segments to handle block pruning.

Thanks,
Ajantha


[Discussion] Encryption support for carbondata files

2018-10-30 Thread Ajantha Bhat
*Background:* Currently carbondata files are not encrypted. If anyone has
carbon reader, they can read the carbondata files.
If the data has sensitive information, that data can be encrypted with the
crypto key.
So, that along with carbon reader this key is required to decrypt and read
the data.

*Why encryption at file format level ?*
As files generated by one application can be used by the other applications
to read.
Also encrypting the data at application level is a time consuming process
as we have very huge data.
and whole carbondata files need to be encrypted from application. This is
redundant.

Only the columns that have sensitive data can be encrypted if we support
encryption at file format level. so that we can have column level
encryption.

*Note:* Also keep in mind that encryption needs more CPU for crypto key
computation and decryption also takes some time.
So, it will impact loading and query time if user wants to encrypt the data.

*So, how many of you think this feature has real world use case and carbon
should have this feature ?*

Based on the need of this feature, I can go ahead and explore the
implementation details.

Thanks,
Ajantha


Re: Propose configurable page size in MB (via carbon property)

2018-10-22 Thread Ajantha Bhat
Hi xuchuanyin,

Thanks for your inputs. Please find some details below.

1. Already there was a size based validation in code for each row
processing.
In 'isVarCharColumnFul()' method. It was checking only for varchar columns.
Now I am checking complex as well as string columns.

2. The logic is for dividing complex byte array to flat byte array is taken
from TablePage.addComplexColumn(). This computation will be moved to my new
method and it will be avoided here.
So no extra computation.

3. Yes,  I will make it as create table property instead of carbon property.
Also I will measure Load performance. Once changes are made.

Thanks,
Ajantha


On Fri, Oct 19, 2018 at 1:56 PM xuchuanyin  wrote:

> Hi, ajantha.
>
> I just go through your PR and think we may need to rethink about this
> feature especially its impact. I leaved a comment under your PR and will
> paste it here for further communication in community.
>
> I'm afraid that in common scenarios even we do not face the page size
> problems and play in the safe area, carbondata will still call this method
> to check the boundaries, which will cause data loading performance
> decreasing.
> So is there a way to avoid unnecessary checking here?
>
> In my opinion, to determine the upper bound of the number of rows in a
> page,
> the default strategy is 'number based' (32000 as the upper bound). Now you
> are adding another additional strategy 'capacity based' (xxMB as the upper
> bound).
>
> There can be multiple strategies for per load, the default is [number
> based], but the user can also configure [number based, capacity based]. So
> before loading, we can get the strategies and apply them while processing.
> At the same time, if the strategies is [number based], we do not need to
> check the capacity, thus avoiding the problem I mentioned above.
>
> Note that we store the rowId in each page using short, it means that the
> number based strategy is a default yet required strategy.
>
> Also, by default, the capacity based strategy is not configured. As for
> this
> strategy, user can add it in:
> 1. TBLProperties in creating table
> 2. Options in loading data
> 3. Options in SdkWriter
> 4. Options in creating table using spark file format
> 5. Options in DataFrameWriter
>
> By all means, we should not configure it in system property, because only
> few of tables use this feature. However adding it in system property will
> decrease their loading performance.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Propose configurable page size in MB (via carbon property)

2018-10-11 Thread Ajantha Bhat
Hi all,
For better in-memory processing of carbondata pages, I am proposing
configurable page size in MB (via carbon property).

The detail background, problem and solution is added in the design document.
Document is attached in the below JIRA.
*https://issues.apache.org/jira/browse/CARBONDATA-3001
*

please go through the document in JIRA and let me know if I can go ahead
with the implementation.

Thanks,
Ajantha


Re: [Serious Issue] Rows disappeared

2018-09-28 Thread Ajantha Bhat
@Aaron:

Please find the issue fix changes in the below PR.

*https://github.com/apache/carbondata/pull/2784
*

I added a test case also and it is passed after my fix.

Thanks,
Ajantha

On Fri, Sep 28, 2018 at 4:57 AM aaron <949835...@qq.com> wrote:

> @Ajantha, Great! looking forward to your fix:)
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Serious Issue] Rows disappeared

2018-09-27 Thread Ajantha Bhat
@Aaron: I was able to reproduce the issue with my own dataset. (total 350
KB data)

Issue is nothing to do with local dictionary.
I have narrowed down the scenario,

it is with sort columns + compaction.

I will fix soon and update you

Thanks,
Ajantha

On Thu, Sep 27, 2018 at 8:05 PM Kumar Vishal 
wrote:

> Hi Aaron,
> Can you please run compaction again with
> *carbon.local.dictionary.decoder.fallback=false
> *and share the result for the same.
>
> -Regards
> Kumar Vishal
>
> On Thu, Sep 27, 2018 at 7:37 PM aaron <949835...@qq.com> wrote:
>
> > This is the method I construct carbon instance, hope this can help you.
> >
> > def carbonSession(appName: String, masterUrl: String, parallelism:
> String,
> > logLevel: String, hdfsUrl:
> > String="hdfs://ec2-dca-aa-p-sdn-16.appannie.org:9000"): SparkSession = {
> > val storeLocation = s"${hdfsUrl}/usr/carbon/data"
> >
> > CarbonProperties.getInstance()
> >   .addProperty(CarbonCommonConstants.STORE_LOCATION, storeLocation)
> >   .addProperty(CarbonCommonConstants.ENABLE_UNSAFE_SORT, "true")
> >   .addProperty(CarbonCommonConstants.ENABLE_OFFHEAP_SORT, "true")
> >   .addProperty(CarbonCommonConstants.CARBON_TASK_DISTRIBUTION,
> > CarbonCommonConstants.CARBON_TASK_DISTRIBUTION_BLOCKLET)
> >
>  .addProperty(CarbonCommonConstants.CARBON_CUSTOM_BLOCK_DISTRIBUTION,
> > "false")
> >   .addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER, "true")
> >   //.addProperty(CarbonCommonConstants.ENABLE_AUTO_HANDOFF, "true")
> >   .addProperty(CarbonCommonConstants.ENABLE_AUTO_LOAD_MERGE, "true")
> >
> > .addProperty(CarbonCommonConstants.COMPACTION_SEGMENT_LEVEL_THRESHOLD,
> > "4,3")
> >   .addProperty(CarbonCommonConstants.DAYS_ALLOWED_TO_COMPACT, "0")
> >   .addProperty(CarbonCommonConstants.CARBON_BADRECORDS_LOC,
> > s"${hdfsUrl}/usr/carbon/badrecords")
> >   .addProperty(CarbonCommonConstants.CARBON_QUERY_MIN_MAX_ENABLED,
> > "true")
> >   .addProperty(CarbonCommonConstants.ENABLE_QUERY_STATISTICS,
> "false")
> >   .addProperty(CarbonCommonConstants.ENABLE_DATA_LOADING_STATISTICS,
> > "false")
> >   .addProperty(CarbonCommonConstants.MAX_QUERY_EXECUTION_TIME, "2")
> //
> > 2 minutes
> >   .addProperty(CarbonCommonConstants.LOCK_TYPE, "HDFSLOCK")
> >   .addProperty(CarbonCommonConstants.LOCK_PATH,
> > s"${hdfsUrl}/usr/carbon/lock")
> >   .addProperty(CarbonCommonConstants.CARBON_MERGE_SORT_READER_THREAD,
> > s"${parallelism}")
> >
> >
> >
> .addProperty(CarbonCommonConstants.CARBON_INVISIBLE_SEGMENTS_PRESERVE_COUNT,
> > "100")
> >   .addProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS,
> > s"${parallelism}")
> >   .addProperty(CarbonCommonConstants.LOAD_SORT_SCOPE, "LOCAL_SORT")
> >   .addProperty(CarbonCommonConstants.NUM_CORES_COMPACTING,
> > s"${parallelism}")
> >   .addProperty(CarbonCommonConstants.UNSAFE_WORKING_MEMORY_IN_MB,
> > "4096")
> >   .addProperty(CarbonCommonConstants.NUM_CORES_LOADING,
> > s"${parallelism}")
> >   .addProperty(CarbonCommonConstants.CARBON_MAJOR_COMPACTION_SIZE,
> > "1024")
> >   .addProperty(CarbonCommonConstants.BLOCKLET_SIZE, "64")
> >   //.addProperty(CarbonCommonConstants.TABLE_BLOCKLET_SIZE, "64")
> >
> > import org.apache.spark.sql.CarbonSession._
> >
> > val carbon = SparkSession
> >   .builder()
> >   .master(masterUrl)
> >   .appName(appName)
> >   .config("spark.hadoop.fs.s3a.impl",
> > "org.apache.hadoop.fs.s3a.S3AFileSystem")
> >   .config("spark.hadoop.dfs.replication", 1)
> >   .config("spark.cores.max", s"${parallelism}")
> >   .getOrCreateCarbonSession(storeLocation)
> >
> > carbon.sparkContext.hadoopConfiguration.setInt("dfs.replication", 1)
> >
> > carbon.sql(s"SET spark.default.parallelism=${parallelism}")
> > carbon.sql(s"SET spark.sql.shuffle.partitions=${parallelism}")
> > carbon.sql(s"SET spark.sql.cbo.enabled=true")
> > carbon.sql(s"SET carbon.options.bad.records.logger.enable=true")
> >
> > carbon.sparkContext.setLogLevel(logLevel)
> > carbon
> >   }
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Serious Issue] Rows disappeared

2018-09-27 Thread Ajantha Bhat
Hi Aaron,
Thanks for reporting issue.
Can you help me narrow down the issue? as I cannot reproduce locally with
the information given in your mail.

a) First can you disable local dictionary and try the same scenario?
b) Can drop datamp and try the same scenario? -- If data is coming from
data map (can see this in explain command)
c) Avoid compaction and try the same scenario.
d) If you can share, give me test data and complete steps. (Because
compaction and other steps are not there in your previous mail)
Mean while, I will try to reproduce locally again but I don't have complete
steps you executed.

Thanks,
Ajantha

On Wed, Sep 26, 2018 at 9:17 PM aaron <949835...@qq.com> wrote:

> Hi Community,
>
> It seems that rows disappeared, same query get different result
>
> carbon.time(carbon.sql(
>   s"""
>  |EXPLAIN SELECT date, market_code, device_code, country_code,
> category_id, product_id, est_free_app_download, est_paid_app_download,
> est_revenue
>  |FROM store
>  |WHERE date = '2016-09-01' AND device_code='ios-phone' AND
> country_code='EE' AND product_id IN (590416158, 590437560)"""
> .stripMargin).show(truncate=false)
> )
>
>
> Screen_Shot_2018-09-26_at_11.png
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-26_at_11.png>
>
> Screen_Shot_2018-09-26_at_11.png
> <
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t357/Screen_Shot_2018-09-26_at_11.png>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: CarbonWriterBuild issue

2018-09-20 Thread Ajantha Bhat
Also now we that we support Hadoop conf,

we don't require below API. we can remove them from CarbonWriterBuilder.







*setAccessKeysetAccessKeysetSecretKeysetSecretKeysetEndPointsetEndPoint*
Thanks,
AB


On Thu, Sep 20, 2018 at 11:16 PM Ajantha Bhat  wrote:

>
> @xuchuanyin:
>
> yes, method signatures will be like you specified.
>
> @kanaka: I still think we should keep only table properties Map as we
> validate "wrong_spells and names". More options will create more confusion.
> So, just keeping table properties Map can simplify configurations. End
> user can form a map and pass. Just like existing withLoadOptions map
>
> Any other suggestions are welcome
>
> Thanks,
> AB
> .
>
> On Thu, Sep 20, 2018 at 10:55 PM kanaka 
> wrote:
>
>> +1 for the proposal to clear SDK APIs.
>> Thanks Ajantha for initiating the code changes.
>>
>> For schema input for  writer creation, I also feel we should unify to all
>> writer creation methods to Builder. API looks cleaner if we provide just
>> build() without out any more arguments.
>>
>>
>> "withTableProperties(Map options) vs
>> sortBy(..),withBlockSize(...),etc"
>> - I think both of these methods can serve for different purposes.
>> withTableProperties(Map options) can be used by customer
>> apps which takes property input directly by end users who is familiar with
>> carbon create table syntax.
>> Individual methods can be used by customers app code to avoid problems
>> like
>> wrong spells or wrong names.
>>
>> "public CarbonWriterBuilder isTransactionalTable(boolean
>> isTransactionalTable)"
>> -- I think we can remove if we are not clear on the usecase at this moment
>> and to avoid confusions
>>
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>
>