Can I perhaps assist? If I can get a bit more specifics of what needs to be done, I think I can help. I'm ok with cython, looking at some C++ code etc.
Sent with ProtonMail Secure Email. -------- Original Message -------- On February 1, 2018 3:31 PM, Wes McKinney <wesmck...@gmail.com> wrote: >I opened https://issues.apache.org/jira/browse/ARROW-2068, which may > help. This is an accessible issue for someone in the community to work > on; I'm not sure when I'll be able to get to it. > > Thanks > Wes > > On Thu, Feb 1, 2018 at 8:27 AM, Eli h5r...@protonmail.ch wrote: >>Hey Wes, >>I understand there's another pointer, a definition level pointer, which is >>basically a null location marker column. Exposing it as well to pick out the >>nulls would be awesome. >>The types of interest (to me) are varchars/strings, bools and numbers, just >>basic primitive types that also exist in standard SQL, so having these two >>columns available via Python would be sweet. >>Thanks, >> Eli >>Sent with ProtonMail Secure Email. >>-------- Original Message -------- >> On January 31, 2018 4:06 PM, Wes McKinney wrote: >>>hi Eli, >>>This isn't available at the moment, but one could make the internal >>> buffers in an array accessible in Python. How would you handle nulls >>> in this scenario (the bytes for a null value in a primitive array can >>> be any value)? How would one handle things other than numbers? >>> - Wes >>>On Wed, Jan 31, 2018 at 5:14 AM, Eli h5r...@protonmail.ch wrote: >>>>Hey Wes, >>>> What I meant by "standard" is the binary representation of a specific type >>>> aggregated together. >>>> The int32 column [1,2,3] would make >>>> '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00' for example. >>>> This is already available via Python's struct.pack(), >>>> array.array().tostring() or np.array().astype().tobytes() >>>> What I was wondering is whatever that specific representation is already >>>> there in Arrow's C++ mechanics somewhere, and whether one can get hold of >>>> it from Pyarrow. >>>> I don't know C++ very well, but I think what I'm looking for is in >>>> buffer.h, there are pointers to types under Buffer which I think point to >>>> just that. >>>> I saw that Buffer is actually accessible via pa.lib.Buffer, and that it >>>> even has a to_pybytes() method. >>>> However: >>>> - I'm not sure those are the bytes that I speak of >>>> >>>> - I'm not sure how to use Buffer to find out, keep getting core dumps when >>>> trying >>>> Sent with ProtonMail Secure Email. >>>> -------- Original Message -------- >>>> On January 10, 2018 7:34 PM, Wes McKinney wrote: >>>> >>>>>hi Eli, >>>>> I am not aware of any standards for binary columns (or at least, I >>>>> don't know what "regular" means in this context) -- part of the >>>>> purpose of the Apache Arrow project is to define a columnar standard >>>>> in the absence of any existing one. Most database systems define their >>>>> own custom wire protocols. >>>>> Do you have a link to the specification for the binary protocol for >>>>> the database you are using (or some other documentation)? >>>>> Thanks, >>>>> Wes >>>>> On Wed, Jan 10, 2018 at 12:47 AM, Eli h5r...@protonmail.ch wrote: >>>>>>Hey Wes, >>>>>> The database in question accepts columnar chunks of "regular" binary >>>>>> data over the network, one of the sources of which is parquet. >>>>>> Thus, data only comes out of parquet on my side, and I was wondering how >>>>>> to get it out as "regular" binary columns. Something like tobytes() for >>>>>> an Arrow Column, or maybe read_asbytes() for pa itself. The purpose is >>>>>> to get to standard binary columns as fast as possible. >>>>>> Thanks, >>>>>> Eli >>>>>> Sent with ProtonMail Secure Email. >>>>>>>-------- Original Message -------- >>>>>>> Subject: Re: How to get "standard" binary columns out of a pyarrow table >>>>>>> Local Time: January 10, 2018 5:32 AM >>>>>>> UTC Time: January 10, 2018 3:32 AM >>>>>>> From: wesmck...@gmail.com >>>>>>> To: dev@arrow.apache.org, Eli h5r...@protonmail.ch >>>>>>> hi Eli, >>>>>>> I'm wondering what kind of API you would want, if the perfect one >>>>>>> existed. If I understand correctly, you are embedding objects in a >>>>>>> BYTE_ARRAY column in Parquet, and need to do some post-processing as >>>>>>> the data goes in / comes out of Parquet? >>>>>>> Thanks, >>>>>>> Wes >>>>>>> On Sat, Jan 6, 2018 at 8:37 AM, Eli h5r...@protonmail.ch wrote: >>>>>>>>Hi, >>>>>>>> I'm looking to send "regular" columnar binary data to a database, the >>>>>>>> kind that gets created by struct.pack, array.array, numpy.tobytes or >>>>>>>> str.encode. >>>>>>>> The origin is parquet files, which I'm reading ever so comfortably via >>>>>>>> PyArrow. >>>>>>>> I do however need to deserialize to Python objcets, currently via >>>>>>>> to_pandas(), then re-serialize the columns with one of the above. >>>>>>>> I was wondering whether there was a better way to go about it, one >>>>>>>> which would be most fast end effective. >>>>>>>> Ideally I'd like to go through Python, but I can do C or even some C++ >>>>>>>> if necessary. >>>>>>>> I posted the question on stackoverflow, and was asked to post here. >>>>>>>> Appreciate any feedback! >>>>>>>> Thanks, >>>>>>>> Eli >>>>>>>> Sent with ProtonMail Secure Email. >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >