Hi,

Regarding the performance testing of the first query for schema inference:

We benchmarked it against contemporary methods, primarily Spark-based
implementations, using a configuration of 2 node controllers and 8 data
partitions.

For a GitHub dataset of 51GB:

Our approach inferred the schema in 51.6 seconds,

Spark’s native implementation took 81.6 seconds,

Methods by Spoth and Mior required 400+ seconds.

I hope this is helpful.
Regards
Calvin Dani


On Thu, Dec 12, 2024 at 5:27 AM Mike Carey <dtab...@gmail.com> wrote:

> Question - I think you were doing some perf testing - do you have perf
> results for these (vs. the current schema function)?
>
> On 12/5/24 12:04 PM, Calvin Dani wrote:
> > Hi,
> >
> > Wanted to share an update regarding the features in the APE. The two
> > queries:
> >
> > 1. query_schema()
> >
> > 2. collection_schema()
> >
> > are now functional. The query_schema() implementation has been submitted
> > for review. Once that is approved, I will proceed to submit the
> > collection_schema() query, as it depends on the first query's code.
> >
> > I would greatly appreciate your feedback, additional test cases, and any
> > thoughts you have on this APE. I’m eager to refine it further or, if it
> > seems like a solid starting point, to receive approval for this APE.
> >
> > Thank you for your time and input!
> >
> > Regards
> >
> > Calvin Dani
> >
> > On Wed, Nov 6, 2024 at 4:06 PM Calvin Dani<calvinthomas.d...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> The APE has been updated with those changes!
> >>
> >> Regards
> >> Calvin Dani
> >>
> >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<dtab...@gmail.com> wrote:
> >>
> >>> Excellent!  +1
> >>>
> >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin Dani<calvinthomas.d...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Thank you for the feedback and as per last meeting here our the
> changes
> >>>> that are incorporated to this APE.
> >>>> They are as follows:
> >>>> 1.  Name of the schema inference functions
> >>>> 2. Schema inference functionality
> >>>>
> >>>> The summary of changes are as follows :
> >>>>
> >>>>     1. query_schema (Aggregate function that takes all records of the
> >>>>     subquery and generates a JSON Schema),
> >>>>     2. collection_schema (JSON Schema translation of the defined
> >>> datatypes
> >>>>     in the metadata node)
> >>>>     3. current_schema (for columnar stores and converting the inferred
> >>>>     schema for storage compaction to JSON Schema)
> >>>>
> >>>>
> >>>> Regards
> >>>> Calvin Dani
> >>>>
> >>>>
> >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<dtab...@gmail.com> wrote:
> >>>>
> >>>>> Great feature!  I wasn't able to understand the query example(s),
> >>>>> though...  Could those be cleaned up a little and clarified?
> >>>>>
> >>>>> Also, I think we might want two functions at the user level - one
> that
> >>>>> takes an expression as input and reports its schema, and another that
> >>>>> takes a dataset/collection name as input and reports its schema.  The
> >>>>> first one would scan the results and say what the schema is; the
> other
> >>>>> would use a more efficient approach (accessing and combining the
> >>>>> metadata from the collection's most recent LSM components in each of
> >>> its
> >>>>> partitions).
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Mike
> >>>>>
> >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote:
> >>>>>> Initiating the discussion thread proposing a new aggregate function
> >>> in
> >>>>>> AsterixDB.
> >>>>>> *Feature:* aggregate function to infer schema
> >>>>>> *Details:* This feature introduces schema inference as an SQL++
> >>>> function
> >>>>>> directly integrated into AsterixDB. It is the first approach to
> >>> offer
> >>>>>> schema inference as a native SQL++ function, allowing users to infer
> >>>>>> schemas for not only any dataset but also for queries and
> >>> subqueries.
> >>>> Its
> >>>>>> output in JSON Schema, the industry standard, produces both human
> >>> and
> >>>>>> machine-readable results, suitable for user interpretation or
> >>>> integration
> >>>>>> into other queries or programs.
> >>>>>>
> >>>>>> Utilizing the template of array_avg() in the Built-in Function and
> >>>>> Function
> >>>>>> collection file the array_schema() was implemented. During self
> >>>> review, a
> >>>>>> lot of defined aggregate functions for
> >>>>>> example SerializableAvgAggregateFunction
> >>>>>> and IntermediateAvgAggregateFunction are not being utilised during
> >>>>>> array_schema() query. Is it due to different use cases or am I
> >>>> utilising
> >>>>> it
> >>>>>> incorrectly?
> >>>>>>
> >>>>>> Are there any resources to understand the functionality of aggregate
> >>>>>> functions in the implementation?
> >>>>>>
> >>>>>> *APE*
> >>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/ASTERIXDB/APE+8%3A+Schema+Inference+Aggregate+Functions

Reply via email to