Nice!
On 12/13/24 4:32 PM, Calvin Dani wrote:
Hi,
Regarding the performance testing of the first query for schema
inference:
We benchmarked it against contemporary methods, primarily Spark-based
implementations, using a configuration of 2 node controllers and 8
data partitions.
For a GitHub dataset of 51GB:
Our approach inferred the schema in 51.6 seconds,
Spark’s native implementation took 81.6 seconds,
Methods by Spoth and Mior required 400+ seconds.
I hope this is helpful.
Regards
Calvin Dani
On Thu, Dec 12, 2024 at 5:27 AM Mike Carey<dtab...@gmail.com> wrote:
Question - I think you were doing some perf testing - do you have
perf
results for these (vs. the current schema function)?
On 12/5/24 12:04 PM, Calvin Dani wrote:
> Hi,
>
> Wanted to share an update regarding the features in the APE. The
two
> queries:
>
> 1. query_schema()
>
> 2. collection_schema()
>
> are now functional. The query_schema() implementation has been
submitted
> for review. Once that is approved, I will proceed to submit the
> collection_schema() query, as it depends on the first query's code.
>
> I would greatly appreciate your feedback, additional test cases,
and any
> thoughts you have on this APE. I’m eager to refine it further
or, if it
> seems like a solid starting point, to receive approval for this
APE.
>
> Thank you for your time and input!
>
> Regards
>
> Calvin Dani
>
> On Wed, Nov 6, 2024 at 4:06 PM Calvin
Dani<calvinthomas.d...@gmail.com>
> wrote:
>
>> Hi,
>>
>> The APE has been updated with those changes!
>>
>> Regards
>> Calvin Dani
>>
>> On Fri, Nov 1, 2024 at 10:36 AM Mike Carey<dtab...@gmail.com>
wrote:
>>
>>> Excellent! +1
>>>
>>> On Fri, Nov 1, 2024 at 9:35 AM Calvin
Dani<calvinthomas.d...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thank you for the feedback and as per last meeting here our
the changes
>>>> that are incorporated to this APE.
>>>> They are as follows:
>>>> 1. Name of the schema inference functions
>>>> 2. Schema inference functionality
>>>>
>>>> The summary of changes are as follows :
>>>>
>>>> 1. query_schema (Aggregate function that takes all
records of the
>>>> subquery and generates a JSON Schema),
>>>> 2. collection_schema (JSON Schema translation of the defined
>>> datatypes
>>>> in the metadata node)
>>>> 3. current_schema (for columnar stores and converting the
inferred
>>>> schema for storage compaction to JSON Schema)
>>>>
>>>>
>>>> Regards
>>>> Calvin Dani
>>>>
>>>>
>>>> On Fri, Oct 4, 2024 at 10:28 AM Mike Carey<dtab...@gmail.com>
wrote:
>>>>
>>>>> Great feature! I wasn't able to understand the query
example(s),
>>>>> though... Could those be cleaned up a little and clarified?
>>>>>
>>>>> Also, I think we might want two functions at the user level
- one that
>>>>> takes an expression as input and reports its schema, and
another that
>>>>> takes a dataset/collection name as input and reports its
schema. The
>>>>> first one would scan the results and say what the schema is;
the other
>>>>> would use a more efficient approach (accessing and combining
the
>>>>> metadata from the collection's most recent LSM components in
each of
>>> its
>>>>> partitions).
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Mike
>>>>>
>>>>> On 10/4/24 10:13 AM, Calvin Dani wrote:
>>>>>> Initiating the discussion thread proposing a new aggregate
function
>>> in
>>>>>> AsterixDB.
>>>>>> *Feature:* aggregate function to infer schema
>>>>>> *Details:* This feature introduces schema inference as an
SQL++
>>>> function
>>>>>> directly integrated into AsterixDB. It is the first approach
to
>>> offer
>>>>>> schema inference as a native SQL++ function, allowing users
to infer
>>>>>> schemas for not only any dataset but also for queries and
>>> subqueries.
>>>> Its
>>>>>> output in JSON Schema, the industry standard, produces both
human
>>> and
>>>>>> machine-readable results, suitable for user interpretation or
>>>> integration
>>>>>> into other queries or programs.
>>>>>>
>>>>>> Utilizing the template of array_avg() in the Built-in
Function and
>>>>> Function
>>>>>> collection file the array_schema() was implemented. During
self
>>>> review, a
>>>>>> lot of defined aggregate functions for
>>>>>> example SerializableAvgAggregateFunction
>>>>>> and IntermediateAvgAggregateFunction are not being utilised
during
>>>>>> array_schema() query. Is it due to different use cases or am I
>>>> utilising
>>>>> it
>>>>>> incorrectly?
>>>>>>
>>>>>> Are there any resources to understand the functionality of
aggregate
>>>>>> functions in the implementation?
>>>>>>
>>>>>> *APE*
>>>>>>
>>>
https://cwiki.apache.org/confluence/display/ASTERIXDB/APE+8%3A+Schema+Inference+Aggregate+Functions