Hi, Regarding the performance testing of the first query for schema inference:
We benchmarked it against contemporary methods, primarily Spark-based implementations, using a configuration of 2 node controllers and 8 data partitions. For a GitHub dataset of 51GB: Our approach inferred the schema in 51.6 seconds, Spark’s native implementation took 81.6 seconds, Methods by Spoth and Mior required 400+ seconds. I hope this is helpful. Regards Calvin Dani On Thu, Dec 12, 2024 at 5:27 AM Mike Carey <dtab...@gmail.com> wrote: > Question - I think you were doing some perf testing - do you have perf > results for these (vs. the current schema function)? > > On 12/5/24 12:04 PM, Calvin Dani wrote: > > Hi, > > > > Wanted to share an update regarding the features in the APE. The two > > queries: > > > > 1. query_schema() > > > > 2. collection_schema() > > > > are now functional. The query_schema() implementation has been submitted > > for review. Once that is approved, I will proceed to submit the > > collection_schema() query, as it depends on the first query's code. > > > > I would greatly appreciate your feedback, additional test cases, and any > > thoughts you have on this APE. I’m eager to refine it further or, if it > > seems like a solid starting point, to receive approval for this APE. > > > > Thank you for your time and input! > > > > Regards > > > > Calvin Dani > > > > On Wed, Nov 6, 2024 at 4:06 PM Calvin Dani<calvinthomas.d...@gmail.com> > > wrote: > > > >> Hi, > >> > >> The APE has been updated with those changes! > >> > >> Regards > >> Calvin Dani > >> > >> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<dtab...@gmail.com> wrote: > >> > >>> Excellent! +1 > >>> > >>> On Fri, Nov 1, 2024 at 9:35 AM Calvin Dani<calvinthomas.d...@gmail.com > > > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> Thank you for the feedback and as per last meeting here our the > changes > >>>> that are incorporated to this APE. > >>>> They are as follows: > >>>> 1. Name of the schema inference functions > >>>> 2. Schema inference functionality > >>>> > >>>> The summary of changes are as follows : > >>>> > >>>> 1. query_schema (Aggregate function that takes all records of the > >>>> subquery and generates a JSON Schema), > >>>> 2. collection_schema (JSON Schema translation of the defined > >>> datatypes > >>>> in the metadata node) > >>>> 3. current_schema (for columnar stores and converting the inferred > >>>> schema for storage compaction to JSON Schema) > >>>> > >>>> > >>>> Regards > >>>> Calvin Dani > >>>> > >>>> > >>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<dtab...@gmail.com> wrote: > >>>> > >>>>> Great feature! I wasn't able to understand the query example(s), > >>>>> though... Could those be cleaned up a little and clarified? > >>>>> > >>>>> Also, I think we might want two functions at the user level - one > that > >>>>> takes an expression as input and reports its schema, and another that > >>>>> takes a dataset/collection name as input and reports its schema. The > >>>>> first one would scan the results and say what the schema is; the > other > >>>>> would use a more efficient approach (accessing and combining the > >>>>> metadata from the collection's most recent LSM components in each of > >>> its > >>>>> partitions). > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Mike > >>>>> > >>>>> On 10/4/24 10:13 AM, Calvin Dani wrote: > >>>>>> Initiating the discussion thread proposing a new aggregate function > >>> in > >>>>>> AsterixDB. > >>>>>> *Feature:* aggregate function to infer schema > >>>>>> *Details:* This feature introduces schema inference as an SQL++ > >>>> function > >>>>>> directly integrated into AsterixDB. It is the first approach to > >>> offer > >>>>>> schema inference as a native SQL++ function, allowing users to infer > >>>>>> schemas for not only any dataset but also for queries and > >>> subqueries. > >>>> Its > >>>>>> output in JSON Schema, the industry standard, produces both human > >>> and > >>>>>> machine-readable results, suitable for user interpretation or > >>>> integration > >>>>>> into other queries or programs. > >>>>>> > >>>>>> Utilizing the template of array_avg() in the Built-in Function and > >>>>> Function > >>>>>> collection file the array_schema() was implemented. During self > >>>> review, a > >>>>>> lot of defined aggregate functions for > >>>>>> example SerializableAvgAggregateFunction > >>>>>> and IntermediateAvgAggregateFunction are not being utilised during > >>>>>> array_schema() query. Is it due to different use cases or am I > >>>> utilising > >>>>> it > >>>>>> incorrectly? > >>>>>> > >>>>>> Are there any resources to understand the functionality of aggregate > >>>>>> functions in the implementation? > >>>>>> > >>>>>> *APE* > >>>>>> > >>> > https://cwiki.apache.org/confluence/display/ASTERIXDB/APE+8%3A+Schema+Inference+Aggregate+Functions