Oops. I received Jacques e-mail after sending mine. I totally agree that the word "record" is dangerous :-o
-H+ On Fri, Feb 27, 2015 at 10:52 AM, Hanifi Gunes <[email protected]> wrote: > I might be wrong but considering that ValueVector roughly refers to a > value container, I would think that the word value should be consistently > used to refer to the top level child element that is stored in any vector > regardless whether the vector is repeated, composite or flat. > > I think it is important to note that the concept of groupings applies to > multi-level repeated types in which case, each value naturally represents a > group. If an external party knows that he is working on a multi-level > repeated type. Then he for sure knows that each individual value by itself > is a group. So I think the word grouping does not seem needed anyway. > > > @Jacques > > - I think value is the problem word. I'm not sure it is better for > groupings > or cells in the case of repeated types. What do they use in Parquet? > Parquet naming conventions are not clear to me either. It relies on > rowCount at the block level and valueCount at the column level. Not sure > about nested types. > > - I'd also like to see this proposal in the context of a larger proposed > design > spec for that jira. > I am working on a more formal proposal. I will open the draft for > community feedback once it is in a good shape. > > > @Jason > > I am always in support of finding a better names. However, I would think > that getChildCount is misleading too as I described above the child of a > vector is a value. If we are not going to coin our own terminology just > like stating that each value consists of individual cells(or a better name > here), I would suggest to be more explicit about naming. > > - (excerpt) Even beyond the issue of repeated confusion, this number also > currently includes nulls, which some devs might find confusing if we > don't document it. > Good point. The broad proposal is to provide documentation alongside > design refactoring. > > > Regards. > -Hanifi > > On Fri, Feb 27, 2015 at 8:16 AM, Jason Altekruse <[email protected] > > wrote: > >> Hanifi, >> >> I think we should try to avoid using the word 'cell' to refer to elements >> within a single value. We often explain the concept of complex data in >> Drill by describing a list or map type being stored in a single database >> 'cell'. Overall I totally agree with the lack of clarity, I would advocate >> for something like getChildCount for the number of members below the >> lists, >> as current database language does not include hierarchies/nesting I think >> this is a safe naming convention. >> >> In response to Jacques comments, we might be at a loss with trying to >> unify >> the concepts of individual values in the case of scalar vectors and entire >> lists/nested structures with a simple name change. It might just be >> clearest to document the getValueCount method at the top level value >> vector >> interface to clearly state that it should match the number of records. >> Even >> beyond the issue of repeated confusion, this number also currently >> includes >> nulls, which some devs might find confusing if we don't document it. >> >> -Jason >> >> On Fri, Feb 27, 2015 at 6:24 AM, Jacques Nadeau <[email protected]> >> wrote: >> >> > I think value is the problem word. I'm not sure it is better for >> groupings >> > or cells in the case of repeated types. What do they use in Parquet? >> > >> > I'd also like to see this proposal in the context of a larger proposed >> > design spec for that jira. >> > On Feb 26, 2015 5:52 PM, "Hanifi Gunes" <[email protected]> wrote: >> > >> > > Hey everyone, >> > > >> > > Scalar ValueVector(VV) types implement getValueCount method, which >> > returns >> > > the number of "value"s stored in the vector. I would expect the same >> be >> > > true for RepeatedVVs as well. However, getValueCount on repeated types >> > > report number of inner/sub-values stored and introduces another method >> > > called groupCount to report actual number of "value"s stored. >> > > >> > > This becomes really confusing and somewhat inconsistent (especially >> for >> > > RepeatedList) as one would expect #getValueCount should report the >> number >> > > of values regardless if the stored value type is nested or flat. >> > > >> > > As part of DRILL-2150, I am refactoring VVs so that getValueCount >> > > universally returns the number of values stored. Alongside, I plan to >> > > introduce a new method getCellCount that reports total number of >> > > sub-values/cells stored in a repeated vector. >> > > >> > > I'd like to probe if anyone has any concerns relating to this. Please >> let >> > > me know. >> > > >> > > >> > > Thanks. >> > > -Hanifi >> > > >> > >> > >
