UDF might have performance problem: Spark built-in UDF vs Spark UDF vs Hive UDF have some different
On 2019/10/07 10:26:07, Ravindra Pesala <ravi.pes...@gmail.com> wrote: > Hi Akash, > > 1. It is better to make it simple and let user provide the udf he wants in > the query. So no need to rewrite the query and no need provide extra > granularity property. > > 3. I got your point why you want to use accumulator to get min/max. But why I > am worried is it should not add complexity to generate min/max as we already > has this information available. I don’t think we should be so bothered about > reading min/max on data loading phase as it is already heavy duty job and > adding few more mills does not do any harm. But as you mentioned it is easier > to do so we can go a head your way. > > > Regards, > Ravindra. > > > On 7 Oct 2019, at 5:38 PM, Akash Nilugal <akashnilu...@gmail.com> wrote: > > > > Hi Ravi, > > > > 1. i) During create datamap, in ctas query, user does not mention udf, so > > if granularity is present in DM properties, then internally we rewrite the > > ctas query with udf and then load the data to datamap according to current > > design. > > ii) but if we say user to give ctas query with udf only, then internally > > no need to rewite the query, we can just load data to it and avoid giving > > the granularity in DMproperties. > > Currently im planning to do first one. Please give your input on this. > > > > 2. Ok, we will not use the RP management in DMProperties, we will use as > > separate command and do proper decoupling. > > > > 3. I think you are referring to the cache pre-priming in index server. > > Problem with this is that, we wil be not sure whether the cache loaded for > > the segment or not, because as per pre-priming design, if loading to cache > > fails after data load to main table, we ignore it as query takes care of > > it. So we cannot completely rely on this feature for min max. > > So for accumulator, im not calculating again, i just take the minmax before > > writing index file in dataload and use that in driver to prepare the > > dataload ranges for datamaps. > > > > The reason to keep the segment min max in the table status of datamap is > > that, it will be helful in RP scenarios, second is we will not be missing > > any data from loading to datamap from main table[if 1st time data came from > > 1 to 4:15 , then next we get data 5:10 to 6, then there might be chance > > that we can miss 15minutes of data from 4 to 4:15]. It will be helpful in > > querying also. So that we can avoid the problem i mentioned above with > > datamaps loaded in cache. > > > > 4. I agree, your point is valid one. I will do more abalysis on this based > > on the user use cases and then we can decide finally. That would be better. > > > > Please give your inputs/suggestions on the above points. > > > > regards, > > Akash R Nilugal > > > > On 2019/10/07 03:03:35, Ravindra Pesala <ravi.pes...@gmail.com> wrote: > >> HI Akash, > >> > >> 1. I feel user providing granularity is redundant, he can just provide > >> respective udf in select query should be enough. > >> > >> 2. I think it is better to add the RP management now itself, otherwise if > >> you start adding to DM properties as temporary then it will never be > >> moved. Better put little more effort to decouple it from datamaps. > >> > >> 3. I feel accumulator is a added cost, we already have feature in > >> development to load datamap immediately after load happens, why not use > >> that? If the datamap is already in memory why we need min/max at segment > >> level? > >> > >> 4. I feel there must be some reason why other timeseries db does not > >> support union of data. Consider a scenario that we have data from 1pm to > >> 4.30 pm , it means 4 to 5pm data is still loading. when user asks the > >> data at hour level I feel it is safe to give data for 1,2,3 hours data, > >> because providing 4pm is actually not a complete data. So atleast user > >> comes to know that 4 pm data is not available and starts querying the low > >> level data if he needs it. > >> I think better get some real uses how user wants this time series data. > >> > >> Regards, > >> Ravindra. > >> > >>> On 4 Oct 2019, at 9:39 PM, Akash Nilugal <akashnilu...@gmail.com> wrote: > >>> > >>> Hi Ravi, > >>> > >>> 1. I forgot to mention the CTAS query in the create datamap statement, i > >>> have updated the document, during create datamap user can give > >>> granularity, during query just the UDF. That should be fine right. > >>> 2. I think may be we can mention the RP policy in DM properties also, and > >>> then may be we provide add RP, drop RP, alter RP for existing and older > >>> datamaps. RP will be taken as a separate subtask and will be handled in > >>> later part. That should be fine i tink. > >>> 3. Actually consider a scenario when datamap is already created, then > >>> load happened to main table, then i use accumulator to get all the min > >>> max to driver, so that i can avoid reading index file in driver in order > >>> to load to datamap. > >>> other scenario is when main table already has segments and then > >>> datamap is created, the we will read index files from each segments to > >>> decide the min max of timestamp column. > >>> 4. We are not storing min max in main table table status. We are storing > >>> in datamap table's table status file, so that it will be used to prepare > >>> the plan during the query phase. > >>> > >>> 5. Other timeseries db supports only getting the data present in hour or > >>> day .. aggregated data. Since we cannot miss the data, plan is to get the > >>> data like higher to lower. May be it does not make much difference when > >>> its from minute to second, but it makes difference from year to month , > >>> so that we cannot avoid aggregations from main table. > >>> > >>> > >>> Regards, > >>> Akash R Nilugal > >>> > >>> On 2019/10/04 11:35:46, Ravindra Pesala <ravi.pes...@gmail.com> wrote: > >>>> Hi Akash, > >>>> > >>>> I have following suggestions. > >>>> > >>>> 1. I think it is redundant to use granularity inside create datamap, > >>>> user can use the respective granularity UDF in his query like time(1h) > >>>> or time(1d) etc. > >>>> > >>>> 2. Better create separate RP commands and let user add the RP on the > >>>> datamap or even on the main table also. It would be more manageable if > >>>> you independent feature for RP instead of including in datamap. > >>>> > >>>> 3. I am not getting why exactly we need accumulator instead of using > >>>> index min/max? Can you explain with some scenario > >>>> > >>>> 4. Why to store min/max at segment level? We can get from datamap also > >>>> right? > >>>> > >>>> 4. Union with high granularity tables to low granularity tables are > >>>> really needed? Any other time series DB is doing it? Or any known use > >>>> case we have? > >>>> > >>>> Regards, > >>>> Ravindra. > >>>> > >>>>> On 1 Oct 2019, at 5:49 PM, Akash Nilugal <akashnilu...@gmail.com> wrote: > >>>>> > >>>>> Hi Babu, > >>>>> > >>>>> Thanks for the inputs. Please find the comments > >>>>> 1. I will change from Union to UnionAll > >>>>> 2. For auto datamap loading, once the data is loaded to lower level > >>>>> granularity datamap, then we load the higher level datamap from the > >>>>> lower level datamap. But as per your point, i think you are telling to > >>>>> load from main table itself. > >>>>> 3. similar to 2nd point, whether to need configuration or not we can > >>>>> decide i think. > >>>>> 4. a. I think the max of the datamap is required to decide the range > >>>>> for the load, because in case of failure case, we may need. > >>>>> b. This point will be taken care. > >>>>> 5. Yes, dataload is sync based on current design, as it is non lazy, it > >>>>> will happen with main table load only. > >>>>> 6. Yes, this will be handled. > >>>>> 7. Already added a task in jira. > >>>>> On 2019/10/01 08:50:05, babu lal jangir <babulaljangir...@gmail.com> > >>>>> wrote: > >>>>>> Hi Akash, Thanks for Time Series DataMap proposal. > >>>>>> Please check below Points. > >>>>>> > >>>>>> 1. During Query Planing Change Union to Union All , Otherwise will > >>>>>> loose > >>>>>> row if same value appears. > >>>>>> 2. Whether system start load for next granularity level table as soon > >>>>>> it > >>>>>> matches the data condition or next granularity level table has to wait > >>>>>> till > >>>>>> current granularity level table is finished ? please handle if > >>>>>> possible. > >>>>>> 3. Add Configuration to load multiple Ranges at a time(across > >>>>>> granularity > >>>>>> tables). > >>>>>> 4. Please check if Current data loading min ,max is enough to find > >>>>>> current > >>>>>> load . No need to refer the DataMap's min,max because data loading > >>>>>> Range > >>>>>> prepration can go wrong if loading happens from multiple driver . i > >>>>>> think > >>>>>> below rules are enough for loading. > >>>>>> 4.a. Create MV should should sync data. On any failure Rebuild > >>>>>> should > >>>>>> sync again till than MV will be disabled. > >>>>>> 4.b. Each load has independent Ranges and should load only those > >>>>>> ranges. Any failure MV may go in disable state(only if intermediate > >>>>>> ranges > >>>>>> load is failed ,last loads failure will NOT make MV disable). > >>>>>> 5. We can make Data loading sync because anyway queries can be served > >>>>>> from > >>>>>> fact table if any segments is in-progress in Datamap. > >>>>>> 6. In Data loading Pipleline ,failures in intermediate time series > >>>>>> datamap, > >>>>>> still we can continue loading next level data. (ignore if already > >>>>>> handled). > >>>>>> For Example. > >>>>>> DataMaps:- Hour,Day,Month Level > >>>>>> Load Data(10 day):- 2018-01-01 01:00:00 to 2018-01-10 01:00:00 > >>>>>> Failure in hour level during below range > >>>>>> 2018-01-06 01:00:00 to 2018-01-06 01:00:00 > >>>>>> This point of time Hour level has 5 day data.so start loading on day > >>>>>> level . > >>>>>> 7. Add SubTask to support loading of in-between missing > >>>>>> time.(Incremental > >>>>>> but old records if timeseries device stopped working for some time). > >>>>>> > >>>>>> On Tue, Oct 1, 2019 at 10:41 AM Akash Nilugal <akashnilu...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi vishal, > >>>>>>> > >>>>>>> In the design document, in the impacted analysis section, there is a > >>>>>>> topic > >>>>>>> compatibility/legacy stores, so basically For old tables when the > >>>>>>> datamap > >>>>>>> is created, we load all the timeseries datamaps with different > >>>>>>> granularity. > >>>>>>> I think this should do fine, please let me know for further > >>>>>>> suggestions/comments. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Akash R Nilugal > >>>>>>> > >>>>>>> On 2019/09/30 17:09:44, Kumar Vishal <kumarvishal1...@gmail.com> > >>>>>>> wrote: > >>>>>>>> Hi Akash, > >>>>>>>> > >>>>>>>> In this desing document you haven't mentioned how to handle data > >>>>>>>> loading > >>>>>>>> for timeseries datamap for older segments[Existing table]. > >>>>>>>> If the customer's main table data is also stored based on > >>>>>>>> time[increasing > >>>>>>>> time] in different segments,he can use this feature as well. > >>>>>>>> > >>>>>>>> We can discuss and finalize the solution. > >>>>>>>> > >>>>>>>> -Regards > >>>>>>>> Kumar Vishal > >>>>>>>> > >>>>>>>> On Mon, Sep 30, 2019 at 2:42 PM Akash Nilugal > >>>>>>>> <akashnilu...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Ajantha, > >>>>>>>>> > >>>>>>>>> Thanks for the queries and suggestions > >>>>>>>>> > >>>>>>>>> 1. Yes, this is a good suggestion, i ll include this change. Both > >>>>>>>>> date > >>>>>>> and > >>>>>>>>> timestamp columns are supported, will be updated in document. > >>>>>>>>> 2. yes, you are right. > >>>>>>>>> 3. you are right, if the day level is not available, then we will > >>>>>>>>> try > >>>>>>> to > >>>>>>>>> get the whole day data from hour level, if not availaible, as > >>>>>>> explained in > >>>>>>>>> design document, we will get the data from datamap UNION data from > >>>>>>>>> main > >>>>>>>>> table based on user query. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Akash R Nilugal > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 2019/09/30 06:56:45, Ajantha Bhat <ajanthab...@gmail.com> wrote: > >>>>>>>>>> + 1 , > >>>>>>>>>> > >>>>>>>>>> I have some suggestions and questions. > >>>>>>>>>> > >>>>>>>>>> 1. In DMPROPERTIES, instead of 'timestamp_column' suggest to use > >>>>>>>>>> 'timeseries_column'. > >>>>>>>>>> so that it won't give an impression that only time stamp datatype > >>>>>>>>>> is > >>>>>>>>>> supported and update the document with all the datatype supported. > >>>>>>>>>> > >>>>>>>>>> 2. Querying on this datamap table is also supported right ? > >>>>>>> supporting > >>>>>>>>>> changing plan for main table to refer datamap table is for user to > >>>>>>> avoid > >>>>>>>>>> changing his query or any other reason ? > >>>>>>>>>> > >>>>>>>>>> 3. If user has not created day granularity datamap, but just > >>>>>>>>>> created > >>>>>>> hour > >>>>>>>>>> granularity datamap. When query has day granularity, data will be > >>>>>>> fetched > >>>>>>>>>> form hour granularity datamap and aggregated ? or data is fetched > >>>>>>> from > >>>>>>>>> main > >>>>>>>>>> table ? > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Ajantha > >>>>>>>>>> > >>>>>>>>>> On Mon, Sep 30, 2019 at 11:46 AM Akash Nilugal < > >>>>>>> akashnilu...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi xuchuanyin, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for the comments/Suggestions > >>>>>>>>>>> > >>>>>>>>>>> 1. Preaggregate is productized, but not the timeseries with > >>>>>>>>> preaggregate, > >>>>>>>>>>> i think you got confused with that, if im right. > >>>>>>>>>>> 2. Limitations like, auto sampling or rollup, which we will be > >>>>>>>>> supporting > >>>>>>>>>>> now. Retention policies. etc > >>>>>>>>>>> 3. segmentTimestampMin, this i will consider in design. > >>>>>>>>>>> 4. RP is added as a separate task, i thought instead of > >>>>>>> maintaining two > >>>>>>>>>>> variables better to maintabin one and parse it. But i will > >>>>>>>>>>> consider > >>>>>>>>> your > >>>>>>>>>>> point based on feasibility during implementation. > >>>>>>>>>>> 5. We use an accumulator which takes list, so before writing index > >>>>>>>>> files > >>>>>>>>>>> we take the min max of the timestamp column and fill in > >>>>>>> accumulator and > >>>>>>>>>>> then we can access accumulator.value in driver after load is > >>>>>>> finished. > >>>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> Akash R Nilugal > >>>>>>>>>>> > >>>>>>>>>>> On 2019/09/28 10:46:31, xuchuanyin <xuchuan...@apache.org> wrote: > >>>>>>>>>>>> Hi akash, glad to see the feature proposed and I have some > >>>>>>> questions > >>>>>>>>>>> about > >>>>>>>>>>>> this. Please notice that some of the following descriptions are > >>>>>>>>> comments > >>>>>>>>>>>> followed by '===' described in the design document attached in > >>>>>>> the > >>>>>>>>>>>> corresponding jira. > >>>>>>>>>>>> > >>>>>>>>>>>> 1. > >>>>>>>>>>>> "Currently carbondata supports timeseries on preaggregate > >>>>>>> datamap, > >>>>>>>>> but > >>>>>>>>>>> its > >>>>>>>>>>>> an alpha feature" > >>>>>>>>>>>> === > >>>>>>>>>>>> It has been some time since the preaggregate datamap was > >>>>>>> introduced > >>>>>>>>> and > >>>>>>>>>>> it > >>>>>>>>>>>> is still **alpha**, why it is still not product-ready? Will the > >>>>>>> new > >>>>>>>>>>> feature > >>>>>>>>>>>> also come into the similar situation? > >>>>>>>>>>>> > >>>>>>>>>>>> 2. > >>>>>>>>>>>> "there are so many limitations when we compare and analyze the > >>>>>>>>> existing > >>>>>>>>>>>> timeseries database or projects which supports time series like > >>>>>>>>> apache > >>>>>>>>>>> druid > >>>>>>>>>>>> or influxdb" > >>>>>>>>>>>> === > >>>>>>>>>>>> What are the actual limitations? Besides, please give an example > >>>>>>> of > >>>>>>>>> this. > >>>>>>>>>>>> > >>>>>>>>>>>> 3. > >>>>>>>>>>>> "Segment_Timestamp_Min" > >>>>>>>>>>>> === > >>>>>>>>>>>> Suggest using camel-case style like 'segmentTimestampMin' > >>>>>>>>>>>> > >>>>>>>>>>>> 4. > >>>>>>>>>>>> "RP is way of telling the system, for how long the data should be > >>>>>>>>> kept" > >>>>>>>>>>>> === > >>>>>>>>>>>> Since the function is simple, I'd suggest using > >>>>>>> 'retentionTime'=15 > >>>>>>>>> and > >>>>>>>>>>>> 'timeUnit'='day' instead of 'RP'='15_days' > >>>>>>>>>>>> > >>>>>>>>>>>> 5. > >>>>>>>>>>>> "When the data load is called for main table, use an spark > >>>>>>>>> accumulator to > >>>>>>>>>>>> get the maximum value of timestamp in that load and return to the > >>>>>>>>> load." > >>>>>>>>>>>> === > >>>>>>>>>>>> How can you get the spark accumulator? The load is launched using > >>>>>>>>>>>> loading-by-dataframe not using global-sort-by-spark. > >>>>>>>>>>>> > >>>>>>>>>>>> 6. > >>>>>>>>>>>> For the rest of the content, still reading. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Sent from: > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>>> > >> > >> > >