Re: Pickle data from python

2018-04-12 Thread Wes McKinney
hi Alberto,

If you cannot find a JIRA about pickling RecordBatch objects, could
you please create one? A patch would be welcome for this; it is
certainly in scope for the project.

If you encounter any new problems, please open a bug report.

Thanks!
Wes

On Thu, Apr 12, 2018 at 3:13 PM, ALBERTO Bocchinfuso
 wrote:
> Hello,
>
> I cannot pickle RecordBatches, Buffers etc.
>
> I found Issue 1654 in the issue tracker, that has been solved with pull 
> request 1238. But this looks to apply only to the types listed (schemas, 
> DataTypes, etc.).
> When I try to Pickle Buffers etc. I get exactly the same error reported in 
> the issue report.
> Is the implementation of the possibility of pickling all the data types of 
> pyarrow (with particular attention to RecordBatches etc.) on the agenda?
>
> Thank you,
> Alberto


Re: Continuous benchmarking setup

2018-04-12 Thread Tom Augspurger
https://github.com/TomAugspurger/asv-runner/ is the setup for the projects 
currently running. Adding arrow to  
https://github.com/TomAugspurger/asv-runner/blob/master/tests/full.yml might 
work. I'll have to redeploy with the update.


From: Wes McKinney 
Sent: Thursday, April 12, 2018 7:24:20 PM
To: dev@arrow.apache.org
Subject: Re: Continuous benchmarking setup

hi Antoine,

I have a bare metal machine at home (affectionately known as the
"pandabox") that's available via SSH that we've been using for
continuous benchmarking for other projects. Arrow is welcome to use
it. I can give you access to the machine if you would like. Hopefully,
we can suitably the process of setting up a continuous benchmarking
machine so that if we need to migrate to a new machine, it is not too
much of a hardship to do so.

Thanks
Wes

On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou  wrote:
>
> Hello
>
> With the following changes, it seems we might reach the point where
> we're able to run the Python-based benchmark suite accross multiple
> commits (at least the ones not anterior to those changes):
> https://github.com/apache/arrow/pull/1775
>
> To make this truly useful, we would need a dedicated host.  Ideally a
> (Linux) OS running on bare metal, with SMT/HyperThreading disabled.
> If running virtualized, the VM should have dedicated physical CPU cores.
>
> That machine would run the benchmarks on a regular basis (perhaps once
> per night) and publish the results in static HTML form somewhere.
>
> (note: nice to have in the future might be access to NVidia hardware,
> but right now there are no CUDA benchmarks in the Python benchmarks)
>
> What should be the procedure here?
>
> Regards
>
> Antoine.


Re: Buffer slices are unsafe

2018-04-12 Thread Wes McKinney
My feeling is that we should advise users of the library that any
slices of a ResizableBuffer become invalid after a call to Resize.

> I was thinking about something like this [0]. The point is, that the slice
> user has no way of knowing if the slice can still be safely used and who
> owns the memory.

You can look at the Buffer parent to see if there is a parent-child
relationship, which at least tells you whether you definitely do _not_
own the memory.

I'm not convinced from this use case that we need to change the way
that the Buffer abstraction works. If there is a need for memory
ownership-nannying, that may be best handled by some other kind of
abstract interface that uses Buffers for its implementation.

- Wes

On Wed, Apr 11, 2018 at 8:05 AM, Antoine Pitrou  wrote:
>
> Hi Dimitri,
>
> Le 11/04/2018 à 13:42, Dimitri Vorona a écrit :
>>
>> I was thinking about something like this [0]. The point is, that the slice
>> user has no way of knowing if the slice can still be safely used and who
>> owns the memory.
>
> I think the answer is that calling free() on something you exported to
> consumers is incorrect.  If you allocate buffers, you should choose a
> Buffer implementation with proper ownership semantics.  For example, we
> have PoolBuffer, but also Python buffers and CUDA buffers.  They all
> (should) have proper ownership.  If you want to create buffers with data
> managed with malloc/free, you need to write a MallocBuffer implementation.
>
>> A step back is a good idea. My use case would be to return a partially
>> built slice on a buffer, while continuing appending to the buffer. Think
>> delta dictionaries: while a slice of the coding table can be sent, we will
>> have additional data to append later on.
>
> I don't know anything about delta dictionaries, but I get the idea.
>
> Does the implementation become harder if you split the coding table into
> several buffers that never get resized?
>
>> To build on your previous proposal: maybe some more finely grained locking
>> mechanism, like the data_ being a shared_ptr, slices grabbing a
>> copy of it when they want to use it and releasing it afterwards? The parent
>> would then check the couter of the shared_ptr (similar to the number of
>> slices).
>
> You need an actual lock to avoid race conditions (the parent may find a
> zero shared_ptr counter, but another thread would grab a data pointer
> immediately after).
>
> I wonder if we really want such implementation complexity.  Also,
> everyone is now paying the price of locking.  Ideally slicing and
> fetching a data pointer should be cheap.  I'd like to know what others
> think about this.
>
> Regards
>
> Antoine.


Re: Continuous benchmarking setup

2018-04-12 Thread Wes McKinney
hi Antoine,

I have a bare metal machine at home (affectionately known as the
"pandabox") that's available via SSH that we've been using for
continuous benchmarking for other projects. Arrow is welcome to use
it. I can give you access to the machine if you would like. Hopefully,
we can suitably the process of setting up a continuous benchmarking
machine so that if we need to migrate to a new machine, it is not too
much of a hardship to do so.

Thanks
Wes

On Wed, Apr 11, 2018 at 9:40 AM, Antoine Pitrou  wrote:
>
> Hello
>
> With the following changes, it seems we might reach the point where
> we're able to run the Python-based benchmark suite accross multiple
> commits (at least the ones not anterior to those changes):
> https://github.com/apache/arrow/pull/1775
>
> To make this truly useful, we would need a dedicated host.  Ideally a
> (Linux) OS running on bare metal, with SMT/HyperThreading disabled.
> If running virtualized, the VM should have dedicated physical CPU cores.
>
> That machine would run the benchmarks on a regular basis (perhaps once
> per night) and publish the results in static HTML form somewhere.
>
> (note: nice to have in the future might be access to NVidia hardware,
> but right now there are no CUDA benchmarks in the Python benchmarks)
>
> What should be the procedure here?
>
> Regards
>
> Antoine.


[jira] [Created] (ARROW-2451) Handle more dtypes efficiently in custom numpy array serializer.

2018-04-12 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-2451:
---

 Summary: Handle more dtypes efficiently in custom numpy array 
serializer.
 Key: ARROW-2451
 URL: https://issues.apache.org/jira/browse/ARROW-2451
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


Right now certain dtypes like bool or fixed length strings are serialized as 
lists, which is inefficient. We can handle these more efficiently by casting 
them to uint8 and saving the original dtype as additional data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Pickle data from python

2018-04-12 Thread ALBERTO Bocchinfuso
Hello,

I cannot pickle RecordBatches, Buffers etc.

I found Issue 1654 in the issue tracker, that has been solved with pull request 
1238. But this looks to apply only to the types listed (schemas, DataTypes, 
etc.).
When I try to Pickle Buffers etc. I get exactly the same error reported in the 
issue report.
Is the implementation of the possibility of pickling all the data types of 
pyarrow (with particular attention to RecordBatches etc.) on the agenda?

Thank you,
Alberto


RE: Correct way to set NULL values in VarCharVector (Java API)?

2018-04-12 Thread Atul Dambalkar
Hi Sid, Emilio,

It was a mistake on my part. I was not setting the holder.start and holder.end 
values inside the NullableVarCharHolder, which was causing the issue. It works 
now.

Regards,
-Atul

-Original Message-
From: Atul Dambalkar 
Sent: Wednesday, April 11, 2018 5:18 PM
To: dev@arrow.apache.org
Subject: RE: Correct way to set NULL values in VarCharVector (Java API)?

Hi Sid, Emilio,

Need some more help. Here is how I am using the NullableVarCharHolder -

--
String value = "some text string";
NullableVarCharHolder holder = new NullableVarCharHolder();
holder.isSet = 1;
byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
holder.buffer = varcharVector.getAllocator().buffer(bytes.length);
holder.buffer.setBytes(0, bytes, 0, bytes.length);
varcharVector.setIndexDefined(index);
varcharVector.setSafe(index, holder);
varcharVector.setValueCount(index + 1);
-

When I try to access the byte[] from VarCharVector as varcharVector.get(index) 
it's returning me null array. If I access the holder.buffer value before 
putting it in the VarCharVector, I can access the correct byte[], but after I 
set it inside the vector, I am getting it as null. Is this correct usage for 
the API? 

-Atul



-Original Message-
From: Siddharth Teotia [mailto:siddha...@dremio.com]
Sent: Wednesday, April 11, 2018 10:27 AM
To: dev@arrow.apache.org
Subject: Re: Correct way to set NULL values in VarCharVector (Java API)?

Another option is to use the set() API that allows you to indicate whether the 
value is NULL or not using an isSet parameter (0 for NULL, 1 otherwise). This 
is similar to holder based APIs where you need to indicate in holder.isSet 
whether value is NULL or not.

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java#L1095

Thanks,
Siddharth

On Wed, Apr 11, 2018 at 6:14 AM, Emilio Lahr-Vivaz 
wrote:

> Hi Atul,
>
> You should be able to use the overloaded 'set' method that takes a
> NullableVarCharHolder:
>
> https://github.com/apache/arrow/blob/master/java/vector/src/
> main/java/org/apache/arrow/vector/VarCharVector.java#L237
>
> Thanks,
>
> Emilio
>
>
> On 04/10/2018 05:23 PM, Atul Dambalkar wrote:
>
>> Hi,
>>
>> I wanted to know what's the best way to handle NULL string values 
>> coming from a relational database. I am trying to set the string 
>> values in Java API - VarCharVector. Like few other Arrow Vectors 
>> (TimeStampVector, TimeMilliVector), the VarCharVector doesn't have a 
>> way to set a NULL value as one of the elements. Can someone advise 
>> what's the correct mechanism to store NULL values in this case.
>>
>> Regards,
>> -Atul
>>
>>
>>
>


[jira] [Created] (ARROW-2450) [Python] Saving to parquet fails for empty lists

2018-04-12 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2450:
--

 Summary: [Python] Saving to parquet fails for empty lists
 Key: ARROW-2450
 URL: https://issues.apache.org/jira/browse/ARROW-2450
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Uwe L. Korn
 Fix For: 0.9.1


When writing a table to parquet through pandas, if any column includes an empty 
list, it fails with a segmentation fault.

Minimal example:

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

def save(rows):
table1 = pa.Table.from_pandas(pd.DataFrame(rows))
pq.write_table(table1, 'test-foo.pq')
table2 = pq.read_table('test-foo.pq')

print('ROWS:', rows)
print('TABLE1:', table1.to_pandas(), sep='\n')
print('TABLE2:', table2.to_pandas(), sep='\n')

save([{'val': ['something']}])
print('---')
save([{'val': []}])  # empty
{code}

Output:

{code}
ROWS: [{'val': ['something']}]
TABLE1:
   val
0  [something]
TABLE2:
   val
0  [something]
---
ROWS: [{'val': []}]
TABLE1:
  val
0  []
[1]13472 segmentation fault (core dumped)  python3 test.py
{code}

Versions:

{code}
$ pip3 list | grep pyarrow
pyarrow (0.9.0)
$ python3 --version
Python 3.5.2
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


new user question about cross-language use

2018-04-12 Thread Chris Withers

Hi All,

Apologies if I'm on the wrong list or struggle to get my question 
across, I'm very new to Arrow, so please point me to the best place if 
there's somewhere better to ask these kinds of questions...


So, in my mind, Arrow provides a single in-memory model that supports 
access from a bunch of different languages/environments (Pandas, Go, 
C++, etc from looking at https://github.com/apache/arrow), which gives 
me hope that, as someone just starting out on a project to go from a 
proprietary C++ trading framework's market data archive to Pandas 
dataframes would be a good way to look and, if things go through arrow 
in the middle, potentially a way for other environments (Go, Julia?) to 
make sure of the same thing.


That left me wondering, however, that if I write a "to arrow" thing is 
C++, how would a Go or Python user then wire things up to get access to 
the Arrow data structures?
Somewhat important bonus point: how would that happen without memory 
copies? (datasets here are many GB is most cases).


cheers,

Chris


[jira] [Created] (ARROW-2449) [Python] Efficiently serialize functions containing NumPy arrays

2018-04-12 Thread Richard Shin (JIRA)
Richard Shin created ARROW-2449:
---

 Summary: [Python] Efficiently serialize functions containing NumPy 
arrays 
 Key: ARROW-2449
 URL: https://issues.apache.org/jira/browse/ARROW-2449
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Richard Shin


It is my understanding that pyarrow falls back to serializing functions (and 
other complex Python objects) using cloudpickle, which means that the contents 
of those functions are also serialized using the fallback method, rather than 
the efficient method described in 
[https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.]
 It would be good to get the benefit of fast zero-copy (de)serialization for 
objects like NumPy arrays contained inside functions.

{code}
In [1]: import numpy as np, pyarrow as pa

In [2]: pa.__version__
Out[2]: '0.9.0'

In [3]: arr = np.random.rand(1)

In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer())
The slowest run took 38.29 times longer than the fastest. This could mean that 
an intermediate result is being cached.
1 loops, best of 3: 68.7 µs per loop

In [5]: def arr_f(): return arr

In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer())
The slowest run took 5.89 times longer than the fastest. This could mean that 
an intermediate result is being cached.
1000 loops, best of 3: 539 µs per loop
{code}

For comparison:

{code}
In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr))
1000 loops, best of 3: 193 µs per loop

In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f))
The slowest run took 4.02 times longer than the fastest. This could mean that 
an intermediate result is being cached.
1000 loops, best of 3: 429 µs per loop
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)