Re: Schema Aggregate Function

Mike Carey Wed, 11 Dec 2024 15:55:59 -0800

Question - I think you were doing some perf testing - do you have perfresults for these (vs. the current schema function)?


On 12/5/24 12:04 PM, Calvin Dani wrote:

Hi,


Wanted to share an update regarding the features in the APE. The two
queries:

1. query_schema()

2. collection_schema()

are now functional. The query_schema() implementation has been submitted
for review. Once that is approved, I will proceed to submit the
collection_schema() query, as it depends on the first query's code.

I would greatly appreciate your feedback, additional test cases, and any
thoughts you have on this APE. I’m eager to refine it further or, if it
seems like a solid starting point, to receive approval for this APE.

Thank you for your time and input!

Regards

Calvin Dani

On Wed, Nov 6, 2024 at 4:06 PM Calvin Dani<[email protected]>
wrote:

Hi,

The APE has been updated with those changes!

Regards
Calvin Dani

On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<[email protected]> wrote:

Excellent!  +1

On Fri, Nov 1, 2024 at 9:35 AM Calvin Dani<[email protected]>
wrote:

Hi,

Thank you for the feedback and as per last meeting here our the changes
that are incorporated to this APE.
They are as follows:
1.  Name of the schema inference functions
2. Schema inference functionality

The summary of changes are as follows :

    1. query_schema (Aggregate function that takes all records of the
    subquery and generates a JSON Schema),
    2. collection_schema (JSON Schema translation of the defined

datatypes

    in the metadata node)
    3. current_schema (for columnar stores and converting the inferred
    schema for storage compaction to JSON Schema)


Regards
Calvin Dani


On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<[email protected]> wrote:

Great feature!  I wasn't able to understand the query example(s),
though...  Could those be cleaned up a little and clarified?

Also, I think we might want two functions at the user level - one that
takes an expression as input and reports its schema, and another that
takes a dataset/collection name as input and reports its schema.  The
first one would scan the results and say what the schema is; the other
would use a more efficient approach (accessing and combining the
metadata from the collection's most recent LSM components in each of

its

partitions).

Cheers,

Mike

On 10/4/24 10:13 AM, Calvin Dani wrote:

Initiating the discussion thread proposing a new aggregate function

in

AsterixDB.
*Feature:* aggregate function to infer schema
*Details:* This feature introduces schema inference as an SQL++

function

directly integrated into AsterixDB. It is the first approach to

offer

schema inference as a native SQL++ function, allowing users to infer
schemas for not only any dataset but also for queries and

subqueries.

Its

output in JSON Schema, the industry standard, produces both human

and

machine-readable results, suitable for user interpretation or

integration

into other queries or programs.

Utilizing the template of array_avg() in the Built-in Function and

Function

collection file the array_schema() was implemented. During self

review, a

lot of defined aggregate functions for
example SerializableAvgAggregateFunction
and IntermediateAvgAggregateFunction are not being utilised during
array_schema() query. Is it due to different use cases or am I

utilising

it

incorrectly?

Are there any resources to understand the functionality of aggregate
functions in the implementation?

*APE*

https://cwiki.apache.org/confluence/display/ASTERIXDB/APE+8%3A+Schema+Inference+Aggregate+Functions

Re: Schema Aggregate Function

Reply via email to