Re: Pickle data from python
hi Alberto, If you cannot find a JIRA about pickling RecordBatch objects, could you please create one? A patch would be welcome for this; it is certainly in scope for the project. If you encounter any new problems, please open a bug report. Thanks! Wes On Thu, Apr 12, 2018 at 3:13 PM, ALBERTO Bocchinfusowrote: > Hello, > > I cannot pickle RecordBatches, Buffers etc. > > I found Issue 1654 in the issue tracker, that has been solved with pull > request 1238. But this looks to apply only to the types listed (schemas, > DataTypes, etc.). > When I try to Pickle Buffers etc. I get exactly the same error reported in > the issue report. > Is the implementation of the possibility of pickling all the data types of > pyarrow (with particular attention to RecordBatches etc.) on the agenda? > > Thank you, > Alberto
Re: Continuous benchmarking setup
https://github.com/TomAugspurger/asv-runner/ is the setup for the projects currently running. Adding arrow to https://github.com/TomAugspurger/asv-runner/blob/master/tests/full.yml might work. I'll have to redeploy with the update. From: Wes McKinneySent: Thursday, April 12, 2018 7:24:20 PM To: dev@arrow.apache.org Subject: Re: Continuous benchmarking setup hi Antoine, I have a bare metal machine at home (affectionately known as the "pandabox") that's available via SSH that we've been using for continuous benchmarking for other projects. Arrow is welcome to use it. I can give you access to the machine if you would like. Hopefully, we can suitably the process of setting up a continuous benchmarking machine so that if we need to migrate to a new machine, it is not too much of a hardship to do so. Thanks Wes On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou wrote: > > Hello > > With the following changes, it seems we might reach the point where > we're able to run the Python-based benchmark suite accross multiple > commits (at least the ones not anterior to those changes): > https://github.com/apache/arrow/pull/1775 > > To make this truly useful, we would need a dedicated host. Ideally a > (Linux) OS running on bare metal, with SMT/HyperThreading disabled. > If running virtualized, the VM should have dedicated physical CPU cores. > > That machine would run the benchmarks on a regular basis (perhaps once > per night) and publish the results in static HTML form somewhere. > > (note: nice to have in the future might be access to NVidia hardware, > but right now there are no CUDA benchmarks in the Python benchmarks) > > What should be the procedure here? > > Regards > > Antoine.
Re: Buffer slices are unsafe
My feeling is that we should advise users of the library that any slices of a ResizableBuffer become invalid after a call to Resize. > I was thinking about something like this [0]. The point is, that the slice > user has no way of knowing if the slice can still be safely used and who > owns the memory. You can look at the Buffer parent to see if there is a parent-child relationship, which at least tells you whether you definitely do _not_ own the memory. I'm not convinced from this use case that we need to change the way that the Buffer abstraction works. If there is a need for memory ownership-nannying, that may be best handled by some other kind of abstract interface that uses Buffers for its implementation. - Wes On Wed, Apr 11, 2018 at 8:05 AM, Antoine Pitrouwrote: > > Hi Dimitri, > > Le 11/04/2018 à 13:42, Dimitri Vorona a écrit : >> >> I was thinking about something like this [0]. The point is, that the slice >> user has no way of knowing if the slice can still be safely used and who >> owns the memory. > > I think the answer is that calling free() on something you exported to > consumers is incorrect. If you allocate buffers, you should choose a > Buffer implementation with proper ownership semantics. For example, we > have PoolBuffer, but also Python buffers and CUDA buffers. They all > (should) have proper ownership. If you want to create buffers with data > managed with malloc/free, you need to write a MallocBuffer implementation. > >> A step back is a good idea. My use case would be to return a partially >> built slice on a buffer, while continuing appending to the buffer. Think >> delta dictionaries: while a slice of the coding table can be sent, we will >> have additional data to append later on. > > I don't know anything about delta dictionaries, but I get the idea. > > Does the implementation become harder if you split the coding table into > several buffers that never get resized? > >> To build on your previous proposal: maybe some more finely grained locking >> mechanism, like the data_ being a shared_ptr , slices grabbing a >> copy of it when they want to use it and releasing it afterwards? The parent >> would then check the couter of the shared_ptr (similar to the number of >> slices). > > You need an actual lock to avoid race conditions (the parent may find a > zero shared_ptr counter, but another thread would grab a data pointer > immediately after). > > I wonder if we really want such implementation complexity. Also, > everyone is now paying the price of locking. Ideally slicing and > fetching a data pointer should be cheap. I'd like to know what others > think about this. > > Regards > > Antoine.
Re: Continuous benchmarking setup
hi Antoine, I have a bare metal machine at home (affectionately known as the "pandabox") that's available via SSH that we've been using for continuous benchmarking for other projects. Arrow is welcome to use it. I can give you access to the machine if you would like. Hopefully, we can suitably the process of setting up a continuous benchmarking machine so that if we need to migrate to a new machine, it is not too much of a hardship to do so. Thanks Wes On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrouwrote: > > Hello > > With the following changes, it seems we might reach the point where > we're able to run the Python-based benchmark suite accross multiple > commits (at least the ones not anterior to those changes): > https://github.com/apache/arrow/pull/1775 > > To make this truly useful, we would need a dedicated host. Ideally a > (Linux) OS running on bare metal, with SMT/HyperThreading disabled. > If running virtualized, the VM should have dedicated physical CPU cores. > > That machine would run the benchmarks on a regular basis (perhaps once > per night) and publish the results in static HTML form somewhere. > > (note: nice to have in the future might be access to NVidia hardware, > but right now there are no CUDA benchmarks in the Python benchmarks) > > What should be the procedure here? > > Regards > > Antoine.
[jira] [Created] (ARROW-2451) Handle more dtypes efficiently in custom numpy array serializer.
Robert Nishihara created ARROW-2451: --- Summary: Handle more dtypes efficiently in custom numpy array serializer. Key: ARROW-2451 URL: https://issues.apache.org/jira/browse/ARROW-2451 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Robert Nishihara Right now certain dtypes like bool or fixed length strings are serialized as lists, which is inefficient. We can handle these more efficiently by casting them to uint8 and saving the original dtype as additional data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Pickle data from python
Hello, I cannot pickle RecordBatches, Buffers etc. I found Issue 1654 in the issue tracker, that has been solved with pull request 1238. But this looks to apply only to the types listed (schemas, DataTypes, etc.). When I try to Pickle Buffers etc. I get exactly the same error reported in the issue report. Is the implementation of the possibility of pickling all the data types of pyarrow (with particular attention to RecordBatches etc.) on the agenda? Thank you, Alberto
RE: Correct way to set NULL values in VarCharVector (Java API)?
Hi Sid, Emilio, It was a mistake on my part. I was not setting the holder.start and holder.end values inside the NullableVarCharHolder, which was causing the issue. It works now. Regards, -Atul -Original Message- From: Atul Dambalkar Sent: Wednesday, April 11, 2018 5:18 PM To: dev@arrow.apache.org Subject: RE: Correct way to set NULL values in VarCharVector (Java API)? Hi Sid, Emilio, Need some more help. Here is how I am using the NullableVarCharHolder - -- String value = "some text string"; NullableVarCharHolder holder = new NullableVarCharHolder(); holder.isSet = 1; byte[] bytes = value.getBytes(StandardCharsets.UTF_8); holder.buffer = varcharVector.getAllocator().buffer(bytes.length); holder.buffer.setBytes(0, bytes, 0, bytes.length); varcharVector.setIndexDefined(index); varcharVector.setSafe(index, holder); varcharVector.setValueCount(index + 1); - When I try to access the byte[] from VarCharVector as varcharVector.get(index) it's returning me null array. If I access the holder.buffer value before putting it in the VarCharVector, I can access the correct byte[], but after I set it inside the vector, I am getting it as null. Is this correct usage for the API? -Atul -Original Message- From: Siddharth Teotia [mailto:siddha...@dremio.com] Sent: Wednesday, April 11, 2018 10:27 AM To: dev@arrow.apache.org Subject: Re: Correct way to set NULL values in VarCharVector (Java API)? Another option is to use the set() API that allows you to indicate whether the value is NULL or not using an isSet parameter (0 for NULL, 1 otherwise). This is similar to holder based APIs where you need to indicate in holder.isSet whether value is NULL or not. https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java#L1095 Thanks, Siddharth On Wed, Apr 11, 2018 at 6:14 AM, Emilio Lahr-Vivazwrote: > Hi Atul, > > You should be able to use the overloaded 'set' method that takes a > NullableVarCharHolder: > > https://github.com/apache/arrow/blob/master/java/vector/src/ > main/java/org/apache/arrow/vector/VarCharVector.java#L237 > > Thanks, > > Emilio > > > On 04/10/2018 05:23 PM, Atul Dambalkar wrote: > >> Hi, >> >> I wanted to know what's the best way to handle NULL string values >> coming from a relational database. I am trying to set the string >> values in Java API - VarCharVector. Like few other Arrow Vectors >> (TimeStampVector, TimeMilliVector), the VarCharVector doesn't have a >> way to set a NULL value as one of the elements. Can someone advise >> what's the correct mechanism to store NULL values in this case. >> >> Regards, >> -Atul >> >> >> >
[jira] [Created] (ARROW-2450) [Python] Saving to parquet fails for empty lists
Uwe L. Korn created ARROW-2450: -- Summary: [Python] Saving to parquet fails for empty lists Key: ARROW-2450 URL: https://issues.apache.org/jira/browse/ARROW-2450 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Uwe L. Korn Fix For: 0.9.1 When writing a table to parquet through pandas, if any column includes an empty list, it fails with a segmentation fault. Minimal example: {code} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd def save(rows): table1 = pa.Table.from_pandas(pd.DataFrame(rows)) pq.write_table(table1, 'test-foo.pq') table2 = pq.read_table('test-foo.pq') print('ROWS:', rows) print('TABLE1:', table1.to_pandas(), sep='\n') print('TABLE2:', table2.to_pandas(), sep='\n') save([{'val': ['something']}]) print('---') save([{'val': []}]) # empty {code} Output: {code} ROWS: [{'val': ['something']}] TABLE1: val 0 [something] TABLE2: val 0 [something] --- ROWS: [{'val': []}] TABLE1: val 0 [] [1]13472 segmentation fault (core dumped) python3 test.py {code} Versions: {code} $ pip3 list | grep pyarrow pyarrow (0.9.0) $ python3 --version Python 3.5.2 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
new user question about cross-language use
Hi All, Apologies if I'm on the wrong list or struggle to get my question across, I'm very new to Arrow, so please point me to the best place if there's somewhere better to ask these kinds of questions... So, in my mind, Arrow provides a single in-memory model that supports access from a bunch of different languages/environments (Pandas, Go, C++, etc from looking at https://github.com/apache/arrow), which gives me hope that, as someone just starting out on a project to go from a proprietary C++ trading framework's market data archive to Pandas dataframes would be a good way to look and, if things go through arrow in the middle, potentially a way for other environments (Go, Julia?) to make sure of the same thing. That left me wondering, however, that if I write a "to arrow" thing is C++, how would a Go or Python user then wire things up to get access to the Arrow data structures? Somewhat important bonus point: how would that happen without memory copies? (datasets here are many GB is most cases). cheers, Chris
[jira] [Created] (ARROW-2449) [Python] Efficiently serialize functions containing NumPy arrays
Richard Shin created ARROW-2449: --- Summary: [Python] Efficiently serialize functions containing NumPy arrays Key: ARROW-2449 URL: https://issues.apache.org/jira/browse/ARROW-2449 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Richard Shin It is my understanding that pyarrow falls back to serializing functions (and other complex Python objects) using cloudpickle, which means that the contents of those functions are also serialized using the fallback method, rather than the efficient method described in [https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.] It would be good to get the benefit of fast zero-copy (de)serialization for objects like NumPy arrays contained inside functions. {code} In [1]: import numpy as np, pyarrow as pa In [2]: pa.__version__ Out[2]: '0.9.0' In [3]: arr = np.random.rand(1) In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer()) The slowest run took 38.29 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loops, best of 3: 68.7 µs per loop In [5]: def arr_f(): return arr In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer()) The slowest run took 5.89 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 539 µs per loop {code} For comparison: {code} In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr)) 1000 loops, best of 3: 193 µs per loop In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f)) The slowest run took 4.02 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 429 µs per loop {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)