If I understand correctly, it seems very similar to the IFrame in Hyrack, which is also a container to store a sequence of record into the ByteBuffer.
I’m not clear about how records are manipulated inside the ValueVectors. In Hyracks case, we store the pointer(usually a pair of (frameID, recordID)) in one java array and manipulate the pointers instead of the original records. We mainly want to break the that one array into multiple small ByteBuffer-based arrays. By doing so, we can reduce the risk of OOM for a large array and we may also take the memory usage of those pointers into account for the flush decisions. @Ted, could you share some insights about how the ValueVectors handles manipulations? e.g. sort, hashing … etc. > On Feb 1, 2016, at 3:16 PM, Ted Dunning <[email protected]> wrote: > > Have you guys looked at the Drill ValueVectors? > > This structure is being spun out as Apache Arrow with multiple interfaces > and language bindings. > > On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <[email protected]> wrote: > >> Hi, >> >> We plan to implement an append-only array at first. The main reason is >> this is how the auxiliary data structure be used so far. Then the >> implementation is straightforward. >> >> The tree-structured Vector in Scala can save a lot in updating case mainly >> because of their immutable requirement. It can saving unnecessary data >> copies comparing to other immutable list when updating. We only allow >> in-place update. The tree design may be overkill for us. >> >> Xi made on detailed design doc is here: >> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing >> Any thoughts or comments? >> >> >>> On Jan 18, 2016, at 9:52 AM, Chen Li <[email protected]> wrote: >>> >>> @Xi and Jianfeng: after we come up with the design, let's share it with >> the >>> group for an approval before the implementation. >>> >>> Chen >>> >>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <[email protected]> wrote: >>> >>>> The accounting is just as critical as the chunking - we should do both >>>> (together). >>>> >>>> >>>> On 1/15/16 9:00 AM, Till Westmann wrote: >>>> >>>>> I don’t have relevant experience on the subject. But I think that it >>>>> sounds good to avoid arbitrarily long chunks of memory. Especially - as >>>>> Jianfeng wrote - it would be good to be able to a) account for this >> memory >>>>> and b) to manage it. >>>>> An interesting question for me would be what the overhead of such a >>>>> Vector is compared to a simple Java array and as a result where it >> should >>>>> be used to replace arrays. (The comparison in [3] only compares >> different >>>>> Scala collections, but doesn’t look at plain arrays.) >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On 14 Jan 2016, at 22:05, Chen Li wrote: >>>>> >>>>> Before we ask Xi to work on this project, it will be good to know if >>>>>> other people have seen similar problems and agree with this plan. >>>>>> @Till: can you share some tips? >>>>>> >>>>>> Chen >>>>>> >>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <[email protected] >>> >>>>>> wrote: >>>>>> >>>>>>> Hi Devs, >>>>>>> >>>>>>> First of all, Xi Zhang is a Master student at UCI wants to work with >> us >>>>>>> for a while. Welcome Xi! >>>>>>> >>>>>>> We are thinking of making a Frame-based, memory-bound >>>>>>> SerializableVector at first. We expect this vector can solve some >>>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks. >>>>>>> Though we did a good job on organizing the record-located memory, the >>>>>>> OOM exception can still happen while operating the auxiliary data >>>>>>> structure. For example in the sort run generator, instead of moving >> record >>>>>>> around we are creating an reference “pointer" array to the original >> record. >>>>>>> However, if the record is small and the size of that int array will >> be >>>>>>> large, then the OOM exception will occur, which is the case of issue >> [1]. >>>>>>> >>>>>>> One way to solve this problem is to put auxiliary data structures >> into >>>>>>> the memory-bounded frame as well. In general, it will be much easier >> to ask >>>>>>> for multiple small memory blocks than one big chunk of memory. I >> guess that >>>>>>> was the same reason why we have “SerializableHashTable” for >> HashJoin. It >>>>>>> will be nice to have a more general structure that can be used by >> all the >>>>>>> operators. >>>>>>> >>>>>>> The Frame based Vector idea is inspired by the Scala Vector[2] which >>>>>>> looks like a List, but internally it is implemented as a 32-ary >> tree. The >>>>>>> performance of it is very stable for variety size of object[3]. It >> will >>>>>>> have all the benefits of ArrayList and the LinkedList. In addition, >> we can >>>>>>> take the memory usage of the auxiliary structure into the >> calculation. We >>>>>>> will work on the detailed design doc later if we are agree on this >>>>>>> direction. >>>>>>> >>>>>>> Any thoughts or suggestions? Thank you! >>>>>>> >>>>>>> >>>>>>> [1] >>>>>>> >> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity >>>>>>> < >>>>>>> >> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity >>> >>>>>>> >>>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie < >>>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie> >>>>>>> [3] >>>>>>> >> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/ >>>>>>> < >>>>>>> >> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/> >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jianfeng Jia >>>>>>> PhD Candidate of Computer Science >>>>>>> University of California, Irvine >>>>>>> >>>>>>> >>>> >> >> >> >> Best, >> >> Jianfeng Jia >> PhD Candidate of Computer Science >> University of California, Irvine >> >> Best, Jianfeng Jia PhD Candidate of Computer Science University of California, Irvine
