UDF might have performance problem: Spark built-in UDF  vs  Spark UDF  vs  Hive 
UDF have some different


On 2019/10/07 10:26:07, Ravindra Pesala <ravi.pes...@gmail.com> wrote: 
> Hi Akash,
> 
> 1. It is better to make it simple and let user provide the udf he wants in 
> the query. So no need to rewrite the query and no need provide extra 
> granularity property.
> 
> 3. I got your point why you want to use accumulator to get min/max. But why I 
> am worried is it should not add complexity to generate min/max as we already 
> has this information available. I don’t think we should be so bothered about 
> reading min/max on data loading phase as it is already heavy duty job and 
> adding few more mills does not do any harm. But as you mentioned it is easier 
> to do so we can go a head your way.
> 
> 
> Regards,
> Ravindra.
> 
> > On 7 Oct 2019, at 5:38 PM, Akash Nilugal <akashnilu...@gmail.com> wrote:
> > 
> > Hi Ravi,
> > 
> > 1. i) During create datamap, in ctas query, user does not mention udf, so 
> > if granularity is present in DM properties, then internally we rewrite the 
> > ctas query with udf and then load the data to datamap according to current 
> > design.
> >   ii) but if we say user to give ctas query with udf only, then internally 
> > no need to rewite the query, we can just load data to it and avoid giving 
> > the granularity in DMproperties.
> >     Currently im planning to do first one. Please give your input on this.
> > 
> > 2. Ok, we will not use the RP management in DMProperties, we will use as 
> > separate command and do proper decoupling.
> > 
> > 3. I think you are referring to the cache pre-priming in index server. 
> > Problem with this is that, we wil be not sure whether the cache loaded for 
> > the segment or not, because as per pre-priming design, if loading to cache 
> > fails after data load to main table, we ignore it as query takes care of 
> > it. So we cannot completely rely on this feature for min max.
> > So for accumulator, im not calculating again, i just take the minmax before 
> > writing index file in dataload and use that in driver to prepare the 
> > dataload ranges for datamaps.
> > 
> > The reason to keep the segment min max in the table status of datamap is 
> > that, it will be helful in RP scenarios, second is we will not be missing 
> > any data from loading to datamap from main table[if 1st time data came from 
> > 1 to 4:15 , then next we get data 5:10 to 6, then there might be chance 
> > that we can miss 15minutes of data from 4 to 4:15]. It will be helpful in 
> > querying also. So that we can avoid the problem i mentioned above with 
> > datamaps loaded in cache.
> > 
> > 4. I agree, your point is valid one. I will do more abalysis on this based 
> > on the user use cases and then we can decide finally. That would be better.
> > 
> > Please give your inputs/suggestions on the above points.
> > 
> > regards,
> > Akash R Nilugal
> > 
> > On 2019/10/07 03:03:35, Ravindra Pesala <ravi.pes...@gmail.com> wrote: 
> >> HI Akash,
> >> 
> >> 1. I feel user providing granularity is redundant, he can just provide 
> >> respective udf in select query should be enough.
> >> 
> >> 2. I think it is better to add the RP management now itself, otherwise if 
> >> you start adding to DM properties as temporary then it will never be 
> >> moved. Better put little more effort to decouple it from datamaps.
> >> 
> >> 3. I feel accumulator is a added cost, we already have feature in 
> >> development to load datamap immediately after load happens, why not use 
> >> that? If the datamap is already in memory why we need min/max at segment 
> >> level?
> >> 
> >> 4. I feel there must be some reason why other timeseries db does not 
> >> support union of data.  Consider a scenario that we have data from 1pm to 
> >> 4.30 pm , it means 4 to 5pm data is still loading.  when user asks the 
> >> data at hour level I feel it is safe to give data for 1,2,3 hours data, 
> >> because providing 4pm is actually not a complete data. So atleast user 
> >> comes to know that 4 pm data is not available and starts querying the low 
> >> level data if he needs it.
> >> I think better get some real uses how user wants this time series data.
> >> 
> >> Regards,
> >> Ravindra.
> >> 
> >>> On 4 Oct 2019, at 9:39 PM, Akash Nilugal <akashnilu...@gmail.com> wrote:
> >>> 
> >>> Hi Ravi,
> >>> 
> >>> 1. I forgot to mention the CTAS query in the create datamap statement, i 
> >>> have updated the document, during create datamap user can give 
> >>> granularity, during query just the UDF. That should be fine right.
> >>> 2. I think may be we can mention the RP policy in DM properties also, and 
> >>> then may be we provide add RP, drop RP, alter RP for existing and older 
> >>> datamaps. RP will be taken as a separate subtask and will be handled in 
> >>> later part. That should be fine i tink.
> >>> 3. Actually consider a scenario when datamap is already created, then 
> >>> load happened to main table, then i use accumulator to get all the min 
> >>> max to driver, so that i can avoid reading index file in driver in order 
> >>> to load to datamap. 
> >>>        other scenario is when main table already has segments and then 
> >>> datamap is created, the we will read index files from each segments to 
> >>> decide the min max of timestamp column.
> >>> 4. We are not storing min max in main table  table status. We are storing 
> >>> in datamap table's table status file, so that it will be used to prepare 
> >>> the plan during the query phase.
> >>> 
> >>> 5. Other timeseries db supports only getting the data present in hour or 
> >>> day .. aggregated data. Since we cannot miss the data, plan is to get the 
> >>> data like higher to lower. May be it does not make much difference when 
> >>> its from minute to second, but it makes difference from year to month , 
> >>> so that we cannot avoid aggregations from main table.
> >>> 
> >>> 
> >>> Regards,
> >>> Akash R Nilugal
> >>> 
> >>> On 2019/10/04 11:35:46, Ravindra Pesala <ravi.pes...@gmail.com> wrote: 
> >>>> Hi Akash,
> >>>> 
> >>>> I have following suggestions.
> >>>> 
> >>>> 1. I think it is redundant to use granularity inside create datamap, 
> >>>> user can use the respective granularity UDF in his query like time(1h) 
> >>>> or time(1d) etc.
> >>>> 
> >>>> 2. Better create separate RP commands and let user add the RP on the 
> >>>> datamap or even on the main table also. It would be more manageable if 
> >>>> you independent feature for RP instead of including in datamap.
> >>>> 
> >>>> 3. I am not getting why exactly we need accumulator instead of using 
> >>>> index min/max? Can you explain with some scenario 
> >>>> 
> >>>> 4. Why to store min/max at segment level? We can get from datamap also 
> >>>> right?
> >>>> 
> >>>> 4.  Union with high granularity tables to low granularity tables are 
> >>>> really needed? Any other time series DB is doing it? Or any known use 
> >>>> case we have?
> >>>> 
> >>>> Regards,
> >>>> Ravindra.
> >>>> 
> >>>>> On 1 Oct 2019, at 5:49 PM, Akash Nilugal <akashnilu...@gmail.com> wrote:
> >>>>> 
> >>>>> Hi Babu,
> >>>>> 
> >>>>> Thanks for the inputs. Please find the comments 
> >>>>> 1. I will change from Union to UnionAll
> >>>>> 2. For auto datamap loading, once the data is loaded to lower level 
> >>>>> granularity datamap, then we load the higher level datamap from the 
> >>>>> lower level datamap. But as per your point, i think you are telling to 
> >>>>> load from main table itself.
> >>>>> 3. similar to 2nd point, whether to need configuration or not we can 
> >>>>> decide i think.
> >>>>> 4. a. I think the max of the datamap is required to decide the range 
> >>>>> for the load, because in case of failure case, we may need.
> >>>>> b. This point will be taken care.
> >>>>> 5. Yes, dataload is sync based on current design, as it is non lazy, it 
> >>>>> will happen with main table load only.
> >>>>> 6. Yes, this will be handled.
> >>>>> 7. Already added a task in jira.
> >>>>> On 2019/10/01 08:50:05, babu lal jangir <babulaljangir...@gmail.com> 
> >>>>> wrote: 
> >>>>>> Hi Akash, Thanks for Time Series DataMap proposal.
> >>>>>> Please check below Points.
> >>>>>> 
> >>>>>> 1. During Query Planing Change Union to Union All , Otherwise will 
> >>>>>> loose
> >>>>>> row if same value appears.
> >>>>>> 2. Whether system start load for next granularity level table as soon 
> >>>>>> it
> >>>>>> matches the data condition or next granularity level table has to wait 
> >>>>>> till
> >>>>>> current  granularity level table is finished ? please handle if 
> >>>>>> possible.
> >>>>>> 3. Add Configuration to load multiple Ranges at a time(across 
> >>>>>> granularity
> >>>>>> tables).
> >>>>>> 4. Please check if Current data loading min ,max is enough to find 
> >>>>>> current
> >>>>>> load . No need to refer the DataMap's min,max because data loading 
> >>>>>> Range
> >>>>>> prepration can go wrong if loading happens from multiple driver . i 
> >>>>>> think
> >>>>>> below rules are enough for loading.
> >>>>>>  4.a. Create MV should should sync data.   On any failure Rebuild 
> >>>>>> should
> >>>>>> sync again till than MV will be disabled.
> >>>>>>  4.b.  Each load has independent Ranges and should load only those
> >>>>>> ranges. Any failure MV may go in disable state(only if intermediate 
> >>>>>> ranges
> >>>>>> load is failed ,last loads failure will NOT make MV disable).
> >>>>>> 5. We can make Data loading sync because anyway queries can be served 
> >>>>>> from
> >>>>>> fact table if any segments is in-progress in  Datamap.
> >>>>>> 6. In Data loading Pipleline ,failures in intermediate time series 
> >>>>>> datamap,
> >>>>>> still we can continue loading next level data. (ignore if already 
> >>>>>> handled).
> >>>>>> For Example.
> >>>>>>  DataMaps:- Hour,Day,Month Level
> >>>>>>  Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00
> >>>>>>    Failure in hour level during below range
> >>>>>>      2018-01-06 01:00:00 to 2018-01-06 01:00:00
> >>>>>>   This point of time Hour level has 5 day data.so start loading on day
> >>>>>> level .
> >>>>>> 7. Add SubTask to support loading of in-between missing 
> >>>>>> time.(Incremental
> >>>>>> but old records if timeseries device stopped working for some time).
> >>>>>> 
> >>>>>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <akashnilu...@gmail.com>
> >>>>>> wrote:
> >>>>>> 
> >>>>>>> Hi vishal,
> >>>>>>> 
> >>>>>>> In the design document, in the impacted analysis section, there is a 
> >>>>>>> topic
> >>>>>>> compatibility/legacy stores, so basically For old tables when the 
> >>>>>>> datamap
> >>>>>>> is created, we load all the timeseries datamaps with different 
> >>>>>>> granularity.
> >>>>>>> I think this should do fine, please let me know for further
> >>>>>>> suggestions/comments.
> >>>>>>> 
> >>>>>>> Regards,
> >>>>>>> Akash R Nilugal
> >>>>>>> 
> >>>>>>> On 2019/09/30 17:09:44, Kumar Vishal <kumarvishal1...@gmail.com> 
> >>>>>>> wrote:
> >>>>>>>> Hi Akash,
> >>>>>>>> 
> >>>>>>>> In this desing document you haven't mentioned how to handle data 
> >>>>>>>> loading
> >>>>>>>> for timeseries datamap for older segments[Existing table].
> >>>>>>>> If the customer's main table data is also stored based on 
> >>>>>>>> time[increasing
> >>>>>>>> time] in different segments,he can use this feature as well.
> >>>>>>>> 
> >>>>>>>> We can discuss and finalize the solution.
> >>>>>>>> 
> >>>>>>>> -Regards
> >>>>>>>> Kumar Vishal
> >>>>>>>> 
> >>>>>>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal 
> >>>>>>>> <akashnilu...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Hi Ajantha,
> >>>>>>>>> 
> >>>>>>>>> Thanks for the queries and suggestions
> >>>>>>>>> 
> >>>>>>>>> 1. Yes, this is a good suggestion, i ll include this change. Both 
> >>>>>>>>> date
> >>>>>>> and
> >>>>>>>>> timestamp columns are supported, will be updated in document.
> >>>>>>>>> 2. yes, you are right.
> >>>>>>>>> 3. you are right, if the day level is not available, then we will 
> >>>>>>>>> try
> >>>>>>> to
> >>>>>>>>> get the whole day data from hour level, if not availaible, as
> >>>>>>> explained in
> >>>>>>>>> design document, we will get the data from datamap UNION data from 
> >>>>>>>>> main
> >>>>>>>>> table based on user query.
> >>>>>>>>> 
> >>>>>>>>> Regards,
> >>>>>>>>> Akash R Nilugal
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> On 2019/09/30 06:56:45, Ajantha Bhat <ajanthab...@gmail.com> wrote:
> >>>>>>>>>> + 1 ,
> >>>>>>>>>> 
> >>>>>>>>>> I have some suggestions and questions.
> >>>>>>>>>> 
> >>>>>>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use
> >>>>>>>>>> 'timeseries_column'.
> >>>>>>>>>> so that it won't give an impression that only time stamp datatype 
> >>>>>>>>>> is
> >>>>>>>>>> supported and update the document with all the datatype supported.
> >>>>>>>>>> 
> >>>>>>>>>> 2. Querying on this datamap table is also supported right ?
> >>>>>>> supporting
> >>>>>>>>>> changing plan for main table to refer datamap table is for user to
> >>>>>>> avoid
> >>>>>>>>>> changing his query or any other reason ?
> >>>>>>>>>> 
> >>>>>>>>>> 3. If user has not created day granularity datamap, but just 
> >>>>>>>>>> created
> >>>>>>> hour
> >>>>>>>>>> granularity datamap. When query has day granularity, data will be
> >>>>>>> fetched
> >>>>>>>>>> form hour granularity datamap and aggregated ? or data is fetched
> >>>>>>> from
> >>>>>>>>> main
> >>>>>>>>>> table ?
> >>>>>>>>>> 
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Ajantha
> >>>>>>>>>> 
> >>>>>>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal <
> >>>>>>> akashnilu...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> Hi xuchuanyin,
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks for the comments/Suggestions
> >>>>>>>>>>> 
> >>>>>>>>>>> 1. Preaggregate is productized, but not the timeseries with
> >>>>>>>>> preaggregate,
> >>>>>>>>>>> i think you  got confused with that, if im right.
> >>>>>>>>>>> 2. Limitations like, auto sampling or rollup, which we will be
> >>>>>>>>> supporting
> >>>>>>>>>>> now. Retention policies. etc
> >>>>>>>>>>> 3. segmentTimestampMin, this i will consider in design.
> >>>>>>>>>>> 4. RP is added as a separate task, i thought instead of
> >>>>>>> maintaining two
> >>>>>>>>>>> variables better to maintabin one and parse it. But i will 
> >>>>>>>>>>> consider
> >>>>>>>>> your
> >>>>>>>>>>> point based on feasibility during implementation.
> >>>>>>>>>>> 5. We use an accumulator which takes list, so before writing index
> >>>>>>>>> files
> >>>>>>>>>>> we take the min max of the timestamp column and fill in
> >>>>>>> accumulator and
> >>>>>>>>>>> then we can access accumulator.value in driver after load is
> >>>>>>> finished.
> >>>>>>>>>>> 
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Akash R Nilugal
> >>>>>>>>>>> 
> >>>>>>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xuchuan...@apache.org> wrote:
> >>>>>>>>>>>> Hi akash, glad to see the feature proposed and I have some
> >>>>>>> questions
> >>>>>>>>>>> about
> >>>>>>>>>>>> this. Please notice that some of the following descriptions are
> >>>>>>>>> comments
> >>>>>>>>>>>> followed by '===' described in the design document attached in
> >>>>>>> the
> >>>>>>>>>>>> corresponding jira.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 1.
> >>>>>>>>>>>> "Currently carbondata supports timeseries on preaggregate
> >>>>>>> datamap,
> >>>>>>>>> but
> >>>>>>>>>>> its
> >>>>>>>>>>>> an alpha feature"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> It has been some time since the preaggregate datamap was
> >>>>>>> introduced
> >>>>>>>>> and
> >>>>>>>>>>> it
> >>>>>>>>>>>> is still **alpha**, why it is still not product-ready? Will the
> >>>>>>> new
> >>>>>>>>>>> feature
> >>>>>>>>>>>> also come into the similar situation?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 2.
> >>>>>>>>>>>> "there are so many limitations when we compare and analyze the
> >>>>>>>>> existing
> >>>>>>>>>>>> timeseries database or projects which supports time series like
> >>>>>>>>> apache
> >>>>>>>>>>> druid
> >>>>>>>>>>>> or influxdb"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> What are the actual limitations? Besides, please give an example
> >>>>>>> of
> >>>>>>>>> this.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 3.
> >>>>>>>>>>>> "Segment_Timestamp_Min"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> Suggest using camel-case style like 'segmentTimestampMin'
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 4.
> >>>>>>>>>>>> "RP is way of telling the system, for how long the data should be
> >>>>>>>>> kept"
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> Since the function is simple, I'd suggest using
> >>>>>>> 'retentionTime'=15
> >>>>>>>>> and
> >>>>>>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days'
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 5.
> >>>>>>>>>>>> "When the data load is called for main table, use an spark
> >>>>>>>>> accumulator to
> >>>>>>>>>>>> get the maximum value of timestamp in that load and return to the
> >>>>>>>>> load."
> >>>>>>>>>>>> ===
> >>>>>>>>>>>> How can you get the spark accumulator? The load is launched using
> >>>>>>>>>>>> loading-by-dataframe not using global-sort-by-spark.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 6.
> >>>>>>>>>>>> For the rest of the content, still reading.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Sent from:
> >>>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>> 
> >>>> 
> >> 
> >> 
> 
> 

Reply via email to