Oops. I received Jacques e-mail after sending mine. I totally agree that
the word "record" is dangerous  :-o

-H+

On Fri, Feb 27, 2015 at 10:52 AM, Hanifi Gunes <[email protected]> wrote:

> I might be wrong but considering that ValueVector roughly refers to a
> value container, I would think that the word value should be consistently
> used to refer to the top level child element that is stored in any vector
> regardless whether the vector is repeated, composite or flat.
>
> I think it is important to note that the concept of groupings applies to
> multi-level repeated types in which case, each value naturally represents a
> group. If an external party knows that he is working on a multi-level
> repeated type. Then he for sure knows that each individual value by itself
> is a group. So I think the word grouping does not seem needed anyway.
>
>
> @Jacques
>
> - I think value is the problem word.  I'm not sure it is better for
> groupings
> or cells in the case of repeated types.  What do they use in Parquet?
> Parquet naming conventions are not clear to me either. It relies on
> rowCount at the block level and valueCount at the column level. Not sure
> about nested types.
>
> - I'd also like to see this proposal in the context of a larger proposed 
> design
> spec for that jira.
> I am working on a more formal proposal. I will open the draft for
> community feedback once it is in a good shape.
>
>
> @Jason
>
> I am always in support of finding a better names. However, I would think
> that getChildCount is misleading too as I described above the child of a
> vector is a value. If we are not going to coin our own terminology just
> like stating that each value consists of individual cells(or a better name
> here), I would suggest to be more explicit about naming.
>
> - (excerpt) Even beyond the issue of repeated confusion, this number also
> currently includes nulls, which some devs might find confusing if we
> don't document it.
> Good point. The broad proposal is to provide documentation alongside
> design refactoring.
>
>
> Regards.
> -Hanifi
>
> On Fri, Feb 27, 2015 at 8:16 AM, Jason Altekruse <[email protected]
> > wrote:
>
>> Hanifi,
>>
>> I think we should try to avoid using the word 'cell' to refer to elements
>> within a single value. We often explain the concept of complex data in
>> Drill by describing a list or map type being stored in a single database
>> 'cell'. Overall I totally agree with the lack of clarity, I would advocate
>> for something like getChildCount for the number of members below the
>> lists,
>> as current database language does not include hierarchies/nesting I think
>> this is a safe naming convention.
>>
>> In response to Jacques comments, we might be at a loss with trying to
>> unify
>> the concepts of individual values in the case of scalar vectors and entire
>> lists/nested structures with a simple name change. It might just be
>> clearest to document the getValueCount method at the top level value
>> vector
>> interface to clearly state that it should match the number of records.
>> Even
>> beyond the issue of repeated confusion, this number also currently
>> includes
>> nulls, which some devs might find confusing if we don't document it.
>>
>> -Jason
>>
>> On Fri, Feb 27, 2015 at 6:24 AM, Jacques Nadeau <[email protected]>
>> wrote:
>>
>> > I think value is the problem word.  I'm not sure it is better for
>> groupings
>> > or cells in the case of repeated types.  What do they use in Parquet?
>> >
>> > I'd also like to see this proposal in the context of a larger proposed
>> > design spec for that jira.
>> > On Feb 26, 2015 5:52 PM, "Hanifi Gunes" <[email protected]> wrote:
>> >
>> > > Hey everyone,
>> > >
>> > > Scalar ValueVector(VV) types implement getValueCount method, which
>> > returns
>> > > the number of "value"s stored in the vector. I would expect the same
>> be
>> > > true for RepeatedVVs as well. However, getValueCount on repeated types
>> > > report number of inner/sub-values stored and introduces another method
>> > > called groupCount to report actual number of "value"s stored.
>> > >
>> > > This becomes really confusing and somewhat inconsistent (especially
>> for
>> > > RepeatedList) as one would expect #getValueCount should report the
>> number
>> > > of values regardless if the stored value type is nested or flat.
>> > >
>> > > As part of DRILL-2150, I am refactoring VVs so that getValueCount
>> > > universally returns the number of values stored. Alongside, I plan to
>> > > introduce a new method getCellCount that reports total number of
>> > > sub-values/cells stored in a repeated vector.
>> > >
>> > > I'd like to probe if anyone has any concerns relating to this. Please
>> let
>> > > me know.
>> > >
>> > >
>> > > Thanks.
>> > > -Hanifi
>> > >
>> >
>>
>
>

Reply via email to