Re: Implement an SerializableVector in Hyracks

Jianfeng Jia Mon, 01 Feb 2016 17:08:57 -0800

If I understand correctly, it seems very similar to the IFrame in Hyrack, which 
is also a container to store a sequence of record into the ByteBuffer.


I’m not clear about how records are manipulated inside the ValueVectors.
In Hyracks case, we store the pointer(usually a pair of (frameID, recordID)) in 
one java array and manipulate the pointers instead of the original records. We 
mainly want to break the that one array into multiple small ByteBuffer-based 
arrays. By doing so, we can reduce the risk of OOM for a large array and we may 
also take the memory usage of those pointers into account for the flush 
decisions.  @Ted, could you share some insights about how the ValueVectors 
handles manipulations? e.g. sort, hashing … etc.  

> On Feb 1, 2016, at 3:16 PM, Ted Dunning <[email protected]> wrote:
> 
> Have you guys looked at the Drill ValueVectors?
> 
> This structure is being spun out as Apache Arrow with multiple interfaces
> and language bindings.
> 
> On Mon, Feb 1, 2016 at 9:56 AM, Jianfeng Jia <[email protected]> wrote:
> 
>> Hi,
>> 
>> We plan to implement an append-only array at first. The main reason is
>> this is how the auxiliary data structure be used so far. Then the
>> implementation is straightforward.
>> 
>> The tree-structured Vector in Scala can save a lot in updating case mainly
>> because of their immutable requirement. It can saving unnecessary data
>> copies comparing to other immutable list when updating. We only allow
>> in-place update. The tree design may be overkill for us.
>> 
>> Xi made on detailed design doc is here:
>> https://docs.google.com/document/d/1bs3JBCxmvuJZmBq_gKzUDt0FZdM9j3d3MPAmC29ZoSA/edit?usp=sharing
>> Any thoughts or comments?
>> 
>> 
>>> On Jan 18, 2016, at 9:52 AM, Chen Li <[email protected]> wrote:
>>> 
>>> @Xi and Jianfeng: after we come up with the design, let's share it with
>> the
>>> group for an approval before the implementation.
>>> 
>>> Chen
>>> 
>>> On Fri, Jan 15, 2016 at 11:48 AM, Mike Carey <[email protected]> wrote:
>>> 
>>>> The accounting is just as critical as the chunking - we should do both
>>>> (together).
>>>> 
>>>> 
>>>> On 1/15/16 9:00 AM, Till Westmann wrote:
>>>> 
>>>>> I don’t have relevant experience on the subject. But I think that it
>>>>> sounds good to avoid arbitrarily long chunks of memory. Especially - as
>>>>> Jianfeng wrote - it would be good to be able to a) account for this
>> memory
>>>>> and b) to manage it.
>>>>> An interesting question for me would be what the overhead of such a
>>>>> Vector is compared to a simple Java array and as a result where it
>> should
>>>>> be used to replace arrays. (The comparison in [3] only compares
>> different
>>>>> Scala collections, but doesn’t look at plain arrays.)
>>>>> 
>>>>> Cheers,
>>>>> Till
>>>>> 
>>>>> On 14 Jan 2016, at 22:05, Chen Li wrote:
>>>>> 
>>>>> Before we ask Xi to work on this project, it will be good to know if
>>>>>> other people have seen similar problems and agree with this plan.
>>>>>> @Till: can you share some tips?
>>>>>> 
>>>>>> Chen
>>>>>> 
>>>>>> On Wed, Jan 13, 2016 at 4:27 PM, Jianfeng Jia <[email protected]
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Devs,
>>>>>>> 
>>>>>>> First of all, Xi Zhang is a Master student at UCI wants to work with
>> us
>>>>>>> for a while. Welcome Xi!
>>>>>>> 
>>>>>>> We are thinking of making a Frame-based, memory-bound
>>>>>>> SerializableVector at first. We expect this vector can solve some
>>>>>>> occasionally Java.Heap.OutOfMemory exceptions in Hyracks.
>>>>>>> Though we did a good job on organizing the record-located memory, the
>>>>>>> OOM exception can still happen while operating the auxiliary data
>>>>>>> structure. For example in the sort run generator, instead of moving
>> record
>>>>>>> around we are creating an reference “pointer" array to the original
>> record.
>>>>>>> However, if the record is small and the size of that int array will
>> be
>>>>>>> large, then the OOM exception will occur, which is the case of issue
>> [1].
>>>>>>> 
>>>>>>> One way to solve this problem is to put auxiliary data structures
>> into
>>>>>>> the memory-bounded frame as well. In general, it will be much easier
>> to ask
>>>>>>> for multiple small memory blocks than one big chunk of memory. I
>> guess that
>>>>>>> was the same reason why we have “SerializableHashTable” for
>> HashJoin. It
>>>>>>> will be nice to have a more general structure that can be used by
>> all the
>>>>>>> operators.
>>>>>>> 
>>>>>>> The Frame based Vector idea is inspired by the Scala Vector[2] which
>>>>>>> looks like a List, but internally it is implemented as a 32-ary
>> tree. The
>>>>>>> performance of it is very stable for variety size of object[3]. It
>> will
>>>>>>> have all the benefits of ArrayList and the LinkedList. In addition,
>> we can
>>>>>>> take the memory usage of the auxiliary structure into the
>> calculation. We
>>>>>>> will work on the detailed design doc later if we are agree on this
>>>>>>> direction.
>>>>>>> 
>>>>>>> Any thoughts or suggestions? Thank you!
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>>>>>> <
>>>>>>> 
>> https://code.google.com/p/asterixdb/issues/detail?id=934&can=1&q=last%20straw&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20ETA%20Severity
>>> 
>>>>>>> 
>>>>>>> [2] https://bitbucket.org/astrieanna/bitmapped-vector-trie <
>>>>>>> https://bitbucket.org/astrieanna/bitmapped-vector-trie>
>>>>>>> [3]
>>>>>>> 
>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/
>>>>>>> <
>>>>>>> 
>> http://danielasfregola.com/2015/06/15/which-immutable-scala-collection/>
>>>>>>> 
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Jianfeng Jia
>>>>>>> PhD Candidate of Computer Science
>>>>>>> University of California, Irvine
>>>>>>> 
>>>>>>> 
>>>> 
>> 
>> 
>> 
>> Best,
>> 
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Implement an SerializableVector in Hyracks

Reply via email to