Note we have https://issues.apache.org/jira/browse/ARROW-1705 (and
maybe some other JIRAs, I'd have to go digging) about improving
support for converting Python dicts to the right Arrow memory layout.

- Wes

On Mon, Jan 22, 2018 at 4:50 PM, simba nyatsanga <simnyatsa...@gmail.com> wrote:
> Hi Uwe,
>
> Thank you very much for the detailed explanation. I have a much better
> understanding now.
>
> Cheers
>
> On Mon, 22 Jan 2018 at 19:37 Uwe L. Korn <uw...@xhochy.com> wrote:
>
>> Hello Simba,
>>
>> find the answers inline.
>>
>> On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote:
>> > Hi Everyone,
>> >
>> > I've got two questions that I'd like help with:
>> >
>> > 1. Pandas and numpy arrays can handle multiple types in a sequence eg. a
>> > float and a string by using the dtype=object. From what I gather, Arrow
>> > arrays enforce a uniform type depending on the type of the first
>> > encountered element in a sequence. This looks like a deliberate choice
>> and
>> > I'd like to get a better understanding of the reason for ensuring this
>> > conformity. Does making the data structure's type deterministic allow for
>> > efficient pointer arithmetic when reading contiguous blocks and thus
>> making
>> > reading performant?
>>
>> As NumPy arrays, Arrow arrays are statically typed. In the case of NumPy
>> you simply have the limitation that the type system can only represent a
>> small number of types. Especially all these types are primitive and allow
>> no nesting (e.g. you cannot implement a NumPy array of NumPy arrays of
>> varying lengths). In NumPy you have the way to work around this limitation
>> by using the object type. This simply means you have any array of (64bit)
>> pointers to Python objects of which NumPy does know nothing. In the most
>> simplistic form, you could achieve the same behaviour by allocating an
>> INT64 Arrow Array, increase the reference count of each object and then
>> store the pointers of the object in this array. While this may work, please
>> don't use this kind of hack.
>>
>> The main concept of Arrow is to define data structures that can be
>> exchanged between applications that are implemented in different languages
>> and ecosystems. Storing Python objects in them is a bit against its use
>> case (we might support this one day for convenience in Python but it will
>> be discouraged). In Arrow we have the concept of a UNION type, i.e. we can
>> specify that a row can contain an object of a fixed set of types. This will
>> bring you nearly the same abilities you have with the object type but with
>> the improvement that you could also pass this data to another Arrow
>> consumer of any language and it can cope with the data. But this also comes
>> a bit at the cost of usability: You need to specify the types that occur in
>> the array (this one is also an "at least for", we may write some
>> auto-detection in the future but this a bit of work).
>>
>> > 2. Pandas and numpy can also handle dictionary elements using the
>> > dtype=object while pyarrow arrays don't. I'd like to understand the
>> > reasoning behind the choice here as well.
>>
>> This is again to due being more statically typed than just supporting
>> pointers to generic objects. For this we actually have at the moment a
>> STRUCT type in Arrow that supports in each row we have a set of named
>> entries where each entry has a fixed type (but the types can be different
>> between entries). Alternatively we also have a MAP<KEY, VALUE> type (that
>> probably needs some more specification work). Here you store data as you do
>> in a typical Python dictionary but KEY and VALUE are fixed types. Depending
>> on your data either STRUCT or MAP might be the correct types to use.
>>
>> As we talk in general about columnar data in the Arrow context, we expect
>> that the data in a column is of the same or a similar type in each row of a
>> column.
>>
>> Uwe
>>

Reply via email to