Re: Schema Aggregate Function

Shiva Jahangiri Thu, 12 Dec 2024 09:11:51 -0800

Hi Professor,

Not sure if the question was for Calvin and I or the dev team, Calvin only
evaluated his implementation vs. other implementations compatible with
Spark. I asked him to run the same evaluation for the current schema
inference of Couchbase and update us with the results.


Best,
Shiva

Shiva Jahangiri
Assistant Professor in Computer Science and Engineering Department
Santa Clara University



On Wed, Dec 11, 2024 at 3:56 PM Mike Carey <[email protected]> wrote:

> Question - I think you were doing some perf testing - do you have perf
> results for these (vs. the current schema function)?
>
> On 12/5/24 12:04 PM, Calvin Dani wrote:
> > Hi,
> >
> > Wanted to share an update regarding the features in the APE. The two
> > queries:
> >
> > 1. query_schema()
> >
> > 2. collection_schema()
> >
> > are now functional. The query_schema() implementation has been submitted
> > for review. Once that is approved, I will proceed to submit the
> > collection_schema() query, as it depends on the first query's code.
> >
> > I would greatly appreciate your feedback, additional test cases, and any
> > thoughts you have on this APE. I’m eager to refine it further or, if it
> > seems like a solid starting point, to receive approval for this APE.
> >
> > Thank you for your time and input!
> >
> > Regards
> >
> > Calvin Dani
> >
> > On Wed, Nov 6, 2024 at 4:06 PM Calvin Dani<[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >> The APE has been updated with those changes!
> >>
> >> Regards
> >> Calvin Dani
> >>
> >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<[email protected]> wrote:
> >>
> >>> Excellent!  +1
> >>>
> >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin Dani<[email protected]
> >
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Thank you for the feedback and as per last meeting here our the
> changes
> >>>> that are incorporated to this APE.
> >>>> They are as follows:
> >>>> 1.  Name of the schema inference functions
> >>>> 2. Schema inference functionality
> >>>>
> >>>> The summary of changes are as follows :
> >>>>
> >>>>     1. query_schema (Aggregate function that takes all records of the
> >>>>     subquery and generates a JSON Schema),
> >>>>     2. collection_schema (JSON Schema translation of the defined
> >>> datatypes
> >>>>     in the metadata node)
> >>>>     3. current_schema (for columnar stores and converting the inferred
> >>>>     schema for storage compaction to JSON Schema)
> >>>>
> >>>>
> >>>> Regards
> >>>> Calvin Dani
> >>>>
> >>>>
> >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<[email protected]> wrote:
> >>>>
> >>>>> Great feature!  I wasn't able to understand the query example(s),
> >>>>> though...  Could those be cleaned up a little and clarified?
> >>>>>
> >>>>> Also, I think we might want two functions at the user level - one
> that
> >>>>> takes an expression as input and reports its schema, and another that
> >>>>> takes a dataset/collection name as input and reports its schema.  The
> >>>>> first one would scan the results and say what the schema is; the
> other
> >>>>> would use a more efficient approach (accessing and combining the
> >>>>> metadata from the collection's most recent LSM components in each of
> >>> its
> >>>>> partitions).
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Mike
> >>>>>
> >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote:
> >>>>>> Initiating the discussion thread proposing a new aggregate function
> >>> in
> >>>>>> AsterixDB.
> >>>>>> *Feature:* aggregate function to infer schema
> >>>>>> *Details:* This feature introduces schema inference as an SQL++
> >>>> function
> >>>>>> directly integrated into AsterixDB. It is the first approach to
> >>> offer
> >>>>>> schema inference as a native SQL++ function, allowing users to infer
> >>>>>> schemas for not only any dataset but also for queries and
> >>> subqueries.
> >>>> Its
> >>>>>> output in JSON Schema, the industry standard, produces both human
> >>> and
> >>>>>> machine-readable results, suitable for user interpretation or
> >>>> integration
> >>>>>> into other queries or programs.
> >>>>>>
> >>>>>> Utilizing the template of array_avg() in the Built-in Function and
> >>>>> Function
> >>>>>> collection file the array_schema() was implemented. During self
> >>>> review, a
> >>>>>> lot of defined aggregate functions for
> >>>>>> example SerializableAvgAggregateFunction
> >>>>>> and IntermediateAvgAggregateFunction are not being utilised during
> >>>>>> array_schema() query. Is it due to different use cases or am I
> >>>> utilising
> >>>>> it
> >>>>>> incorrectly?
> >>>>>>
> >>>>>> Are there any resources to understand the functionality of aggregate
> >>>>>> functions in the implementation?
> >>>>>>
> >>>>>> *APE*
> >>>>>>
> >>>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*8*3A*Schema*Inference*Aggregate*Functions__;KyUrKysr!!MLMg-p0Z!FqyvPSHBDBmzmh0OrvaQYC1J49E7KD-JoCIprJ0As6vC7oII6114rFWyzF5fy4m-ntYx0OKWQfu1Is74$

Re: Schema Aggregate Function

Reply via email to