Hi, This paper (http://arxiv.org/pdf/1105.4252.pdf) has column oriented (one file per column) vs RCFile. They use skip list and lazy record construction.
Cheers, Karthik On 14 September 2012 17:15, Amir Youssefi <[email protected]> wrote: > "Nested data is not yet implemented" in BigQuery (if I recall exact words > correctly). Quoting speaker at the BigQuery presentation at Google > Technology User Group last week in Googleplex (intentionally not citing > speaker's name). > > -ay > > On Sep 14, 2012, at 1:28 PM, David Gruzman <[email protected]> wrote: > > > I assume that evolution of BigQuery reflects resolution of Dremel... If > > somebody have information on it it would be great. > > Storage system should understand that all file comprising the horizontal > > partition of the table are one logical entity, and should store them > > together / in some proximity. I agree that PAX will be much more > > convinient. The question is - is there performance penalty of PAX vs file > > per column? > > David > > > > On Fri, Sep 14, 2012 at 11:21 PM, Tomer Shiran <[email protected]> > wrote: > > > >> Is there any public information suggesting that Google moved away from > >> supporting nested data? Clearly BigQuery doesn't yet allow nested data, > but > >> not sure that applies to Dremel. > >> > >> There are challenges with one file per column. How do you ensure that a > >> single record is located on a single machine to avoid costly record > >> reconstruction? > >> > >> On Fri, Sep 14, 2012 at 1:05 PM, David Gruzman <[email protected] > >>> wrote: > >> > >>> Hi All, > >>> I would like to discuss the question of what will be native format for > >>> drill. Original Google dremel paper defined their hierarchical columnar > >>> data format. Since then > >>> google shifted from hierarchical data format... So it is a question if > it > >>> makes sense to stick with it? > >>> If we are also moving to simple flat format we need our own format we > >> have > >>> to support "native". In case of Drill I would define that native > support > >> as > >>> "high performance". > >>> I think we can go to some kind of PAX format with comprehensive > metadata > >> in > >>> the header, so each file is completely self contained and can be > >> understood > >>> and processed without any external data. > >>> Alternative is to have single file per column. As far as I remember > from > >>> our OpenDremel work the main decision point is - if we can read one > >> column > >>> from the file without loading into node memory unnecessary data from > >> other > >>> columns. > >>> With best regards, > >>> David > >>> > >> > >> > >> > >> -- > >> Tomer Shiran > >> Director of Product Management | MapR Technologies | 650-804-8657 > >> >
