Re: How to get "standard" binary columns out of a pyarrow table

Eli Wed, 31 Jan 2018 02:14:36 -0800

Hey Wes,


What I meant by "standard" is the binary representation of a specific type 
aggregated together.

The int32 column [1,2,3] would make 
'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example.

This is already available via Python's struct.pack(), array.array().tostring() 
or np.array().astype().tobytes()

What I was wondering is whatever that specific representation is already there 
in Arrow's C++ mechanics somewhere, and whether one can get hold of it from 
Pyarrow.

I don't know C++ very well, but I think what I'm looking for is in buffer.h, 
there are pointers to types under Buffer which I think point to just that.

I saw that Buffer is actually accessible via pa.lib.Buffer, and that it even 
has a to_pybytes() method.

However:

- I'm not sure those are the bytes that I speak of

- I'm not sure how to use Buffer to find out, keep getting core dumps when 
trying



Sent with ProtonMail Secure Email.


-------- Original Message --------
 On January 10, 2018 7:34 PM, Wes McKinney  wrote:

>hi Eli,
>
> I am not aware of any standards for binary columns (or at least, I
> don't know what "regular" means in this context) -- part of the
> purpose of the Apache Arrow project is to define a columnar standard
> in the absence of any existing one. Most database systems define their
> own custom wire protocols.
>
> Do you have a link to the specification for the binary protocol for
> the database you are using (or some other documentation)?
>
> Thanks,
> Wes
>
> On Wed, Jan 10, 2018 at 12:47 AM, Eli [email protected] wrote:
>>Hey Wes,
>>The database in question accepts columnar chunks of "regular" binary data 
>>over the network, one of the sources of which is parquet.
>>Thus, data only comes out of parquet on my side, and I was wondering how to 
>>get it out as "regular" binary columns. Something like tobytes() for an Arrow 
>>Column, or maybe read_asbytes() for pa itself. The purpose is to get to 
>>standard binary columns as fast as possible.
>>Thanks,
>> Eli
>>Sent with ProtonMail Secure Email.
>>>-------- Original Message --------
>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table
>>> Local Time: January 10, 2018 5:32 AM
>>> UTC Time: January 10, 2018 3:32 AM
>>> From: [email protected]
>>> To: [email protected], Eli [email protected]
>>>hi Eli,
>>>I'm wondering what kind of API you would want, if the perfect one
>>> existed. If I understand correctly, you are embedding objects in a
>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as
>>> the data goes in / comes out of Parquet?
>>>Thanks,
>>> Wes
>>>On Sat, Jan 6, 2018 at 8:37 AM, Eli [email protected] wrote:
>>>>Hi,
>>>> I'm looking to send "regular" columnar binary data to a database, the kind 
>>>> that gets created by struct.pack, array.array, numpy.tobytes or str.encode.
>>>> The origin is parquet files, which I'm reading ever so comfortably via 
>>>> PyArrow.
>>>> I do however need to deserialize to Python objcets, currently via 
>>>> to_pandas(), then re-serialize the columns with one of the above.
>>>> I was wondering whether there was a better way to go about it, one which 
>>>> would be most fast end effective.
>>>> Ideally I'd like to go through Python, but I can do C or even some C++ if 
>>>> necessary.
>>>> I posted the question on stackoverflow, and was asked to post here. 
>>>> Appreciate any feedback!
>>>> Thanks,
>>>> Eli
>>>> Sent with ProtonMail Secure Email.
>>>>
>>>
>>
>

Re: How to get "standard" binary columns out of a pyarrow table

Reply via email to