Re: [RESULT] [VOTE] Release Apache Arrow JavaScript 0.3.0 - RC0

2018-02-21 Thread Wes McKinney
I just rebased master on the JS release tag, so PRs opened in the last
few days may need to be rebased

On Wed, Feb 21, 2018 at 1:12 PM, Wes McKinney  wrote:
> With 3 binding +1 votes, and 1 non-binding +1 vote, and no other
> votes, the vote passes.
>
> Thanks all! I will upload the artifacts to SVN and post to NPM this
> afternoon, and send an announcement to announce@
>
> - Wes
>
> On Wed, Feb 21, 2018 at 12:57 PM, Jacques Nadeau  wrote:
>> Looks good to me. Verified on OSX.
>>
>> +1 (binding)
>>
>> On Tue, Feb 20, 2018 at 1:24 PM, Brian Hulette 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Ran dev/release/js-verify-release-candidate.sh with Node v8.9.1 on Ubuntu
>>> 16.04, looks good
>>>
>>> Also verified the output of ./targets/es2015/cjs/bin/arrow2csv.js on a
>>> test file
>>>
>>>
>>> On 02/20/2018 03:50 PM, Uwe L. Korn wrote:
>>>
 +1 (binding)
   Ran dev/release/js-verify-release-candidate.sh with Node 9.5.0 on
 Ubuntu 16.04, looks good

 On Mon, Feb 19, 2018, at 9:56 PM, Wes McKinney wrote:

> +1 (binding)
>
> Ran dev/release/js-verify-release-candidate.sh with Node 9.2. Looks good
>
> On Mon, Feb 19, 2018 at 3:54 PM, Wes McKinney 
> wrote:
>
>> Hello all,
>>
>> I'd like to propose the 1st release candidate (rc0) of Apache
>> Arrow JavaScript version 0.3.0.  This will be the second JavaScript
>> release, made separately from the main project releases.
>>
>> The source release rc0 is hosted at [1].
>>
>> This release candidate is based on commit
>> 7d992de1de7dd276eb9aeda349376e79b62da11c
>>
>> Please download, verify checksums and signatures, run the unit tests,
>> and vote
>> on the release. The easiest way is to use the JavaScript-specific
>> release
>> verification script dev/release/js-verify-release-candidate.sh.
>>
>> The vote will be open for at least 24 hours and will close once
>> enough PMCs have approved the release.
>>
>> [ ] +1 Release this as Apache Arrow JavaScript 0.3.0
>> [ ] +0
>> [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.0 because...
>>
>> Thanks,
>> Wes
>>
>> How to validate a release signature:
>> https://httpd.apache.org/dev/verification.html
>>
>> [1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js
>> -0.3.0-rc0/
>> [2]: https://github.com/apache/arrow/tree/7d992de1de7dd276eb9aeda
>> 349376e79b62da11c
>>
>
>>>


[jira] [Created] (ARROW-2196) [C++] Consider quarantining platform code with dependency on non-header Boost code

2018-02-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2196:
---

 Summary: [C++] Consider quarantining platform code with dependency 
on non-header Boost code
 Key: ARROW-2196
 URL: https://issues.apache.org/jira/browse/ARROW-2196
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


see discussion in ARROW-2193 for the motivation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Python] Retrieving a RecordBatch from plasma inside a function

2018-02-21 Thread Philipp Moritz
I created one here: https://issues.apache.org/jira/browse/ARROW-2195

On Wed, Feb 21, 2018 at 8:11 AM, Wes McKinney  wrote:

> Can we create a JIRA to track this issue?
>
> On Wed, Feb 21, 2018 at 5:04 AM, ALBERTO Bocchinfuso
>  wrote:
> > Hi,
> >
> > Have you had any news on this issue?
> > Do you plan to solve it for the next releases of Arrow, or is there any
> way to avoid the problem?
> >
> > Thanks in advance,
> > Alberto
> > Da: Philipp Moritz
> > Inviato: venerdì 9 febbraio 2018 00:30
> > A: dev@arrow.apache.org
> > Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> function
> >
> > Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
> > plan to look into it on the weekend.
> >
> > Here is the preliminary backtrace for everybody interested:
> >
> > CESS (code=1, address=0x38158)
> >
> > frame #0: 0x00010e6457fc
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) +
> 28
> >
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
> >
> > ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
> >
> > 0x10e645800 <+32>: callq  0x10e698170   ; symbol stub
> for:
> > PyInt_FromLong
> >
> > 0x10e645805 <+37>: testq  %rax, %rax
> >
> > 0x10e645808 <+40>: je 0x10e64580c   ; <+44>
> >
> > (lldb) bt
> >
> > * thread #1: tid = 0xf1378e, 0x00010e6457fc
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) +
> 28,
> > queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
> > address=0x38158)
> >
> >   * frame #0: 0x00010e6457fc
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) +
> 28
> >
> > frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_
> CallNoArg(_object*)
> > + 133
> >
> > frame #2: 0x00010e613b25
> > lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
> >
> > frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
> >
> > frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
> > 22305
> >
> > On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
> > alberto_boc...@hotmail.it> wrote:
> >
> >> Hi,
> >>
> >> I’m using python 3.5.2 and pyarrow 0.8.0
> >>
> >> As key, I put a string of 20 bytes, of course. I’m doing it differently
> >> from the canonical way since I’m no more using python 2.7, but python 3,
> >> and this seemed to me to be the right way to create a string of 20
> bytes.
> >> The full code is:
> >>
> >> import pyarrow as pa
> >> import pyarrow.plasma as plasma
> >>
> >> def retrieve1():
> >>  client = plasma.connect('test', "", 0)
> >>
> >>  key = "keynumber1keynumber1"
> >>  pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >>
> >>  [buff] = client .get_buffers([pid])
> >>  batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> >>
> >>  print(batch)
> >>  print(batch.schema)
> >>  print(batch[0])
> >>
> >>  return batch
> >>
> >> client = plasma.connect('test', "", 0)
> >>
> >> test1 = [1, 12, 23, 3, 21, 34]
> >> test1 = pa.array(test1, pa.int32())
> >>
> >> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> >>
> >> key = "keynumber1keynumber1"
> >> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >> sink = pa.MockOutputStream()
> >> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> >> stream_writer.write_batch(batch)
> >> stream_writer.close()
> >>
> >> bff = client.create(pid, sink.size())
> >>
> >> stream = pa.FixedSizeBufferWriter(bff)
> >> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >> writer.write_batch(batch)
> >> client.seal(pid)
> >>
> >> batch = retrieve1()
> >> print(batch)
> >> print(batch.schema)
> >> print(batch[0])
> >>
> >> I hope this helps,
> >> thank you
> >>
> >> Da: Philipp Moritz
> >> Inviato: martedì 6 febbraio 2018 00:00
> >> A: dev@arrow.apache.org
> >> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> >> function
> >>
> >> Hey Alberto,
> >>
> >> Thanks for your message! I'm trying to reproduce it.
> >>
> >> Can you attach the code you use to write the batch into the store?
> >>
> >> Also can you say which version of Python and Arrow you are using? On my
> >> installation, I get
> >>
> >> ```
> >>
> >> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
> >>
> >> 
> >> ---
> >>
> >> ValueErrorTraceback (most recent call
> last)
> >>
> >>  in ()
> >>
> >> > 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
> >>
> >>
> >> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
> >>
> >>
> >> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
> >> ```
> >>
> >> 

[jira] [Created] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2195:
-

 Summary: [Plasma] Segfault when retrieving RecordBatch from plasma 
store
 Key: ARROW-2195
 URL: https://issues.apache.org/jira/browse/ARROW-2195
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


It can be reproduced with the following script:

```
import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
             client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
             pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
             batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
             print(batch.schema)
             print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

```

 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[RESULT] [VOTE] Release Apache Arrow JavaScript 0.3.0 - RC0

2018-02-21 Thread Wes McKinney
With 3 binding +1 votes, and 1 non-binding +1 vote, and no other
votes, the vote passes.

Thanks all! I will upload the artifacts to SVN and post to NPM this
afternoon, and send an announcement to announce@

- Wes

On Wed, Feb 21, 2018 at 12:57 PM, Jacques Nadeau  wrote:
> Looks good to me. Verified on OSX.
>
> +1 (binding)
>
> On Tue, Feb 20, 2018 at 1:24 PM, Brian Hulette 
> wrote:
>
>> +1 (non-binding)
>>
>> Ran dev/release/js-verify-release-candidate.sh with Node v8.9.1 on Ubuntu
>> 16.04, looks good
>>
>> Also verified the output of ./targets/es2015/cjs/bin/arrow2csv.js on a
>> test file
>>
>>
>> On 02/20/2018 03:50 PM, Uwe L. Korn wrote:
>>
>>> +1 (binding)
>>>   Ran dev/release/js-verify-release-candidate.sh with Node 9.5.0 on
>>> Ubuntu 16.04, looks good
>>>
>>> On Mon, Feb 19, 2018, at 9:56 PM, Wes McKinney wrote:
>>>
 +1 (binding)

 Ran dev/release/js-verify-release-candidate.sh with Node 9.2. Looks good

 On Mon, Feb 19, 2018 at 3:54 PM, Wes McKinney 
 wrote:

> Hello all,
>
> I'd like to propose the 1st release candidate (rc0) of Apache
> Arrow JavaScript version 0.3.0.  This will be the second JavaScript
> release, made separately from the main project releases.
>
> The source release rc0 is hosted at [1].
>
> This release candidate is based on commit
> 7d992de1de7dd276eb9aeda349376e79b62da11c
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote
> on the release. The easiest way is to use the JavaScript-specific
> release
> verification script dev/release/js-verify-release-candidate.sh.
>
> The vote will be open for at least 24 hours and will close once
> enough PMCs have approved the release.
>
> [ ] +1 Release this as Apache Arrow JavaScript 0.3.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.0 because...
>
> Thanks,
> Wes
>
> How to validate a release signature:
> https://httpd.apache.org/dev/verification.html
>
> [1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js
> -0.3.0-rc0/
> [2]: https://github.com/apache/arrow/tree/7d992de1de7dd276eb9aeda
> 349376e79b62da11c
>

>>


Re: [VOTE] Release Apache Arrow JavaScript 0.3.0 - RC0

2018-02-21 Thread Jacques Nadeau
Looks good to me. Verified on OSX.

+1 (binding)

On Tue, Feb 20, 2018 at 1:24 PM, Brian Hulette 
wrote:

> +1 (non-binding)
>
> Ran dev/release/js-verify-release-candidate.sh with Node v8.9.1 on Ubuntu
> 16.04, looks good
>
> Also verified the output of ./targets/es2015/cjs/bin/arrow2csv.js on a
> test file
>
>
> On 02/20/2018 03:50 PM, Uwe L. Korn wrote:
>
>> +1 (binding)
>>   Ran dev/release/js-verify-release-candidate.sh with Node 9.5.0 on
>> Ubuntu 16.04, looks good
>>
>> On Mon, Feb 19, 2018, at 9:56 PM, Wes McKinney wrote:
>>
>>> +1 (binding)
>>>
>>> Ran dev/release/js-verify-release-candidate.sh with Node 9.2. Looks good
>>>
>>> On Mon, Feb 19, 2018 at 3:54 PM, Wes McKinney 
>>> wrote:
>>>
 Hello all,

 I'd like to propose the 1st release candidate (rc0) of Apache
 Arrow JavaScript version 0.3.0.  This will be the second JavaScript
 release, made separately from the main project releases.

 The source release rc0 is hosted at [1].

 This release candidate is based on commit
 7d992de1de7dd276eb9aeda349376e79b62da11c

 Please download, verify checksums and signatures, run the unit tests,
 and vote
 on the release. The easiest way is to use the JavaScript-specific
 release
 verification script dev/release/js-verify-release-candidate.sh.

 The vote will be open for at least 24 hours and will close once
 enough PMCs have approved the release.

 [ ] +1 Release this as Apache Arrow JavaScript 0.3.0
 [ ] +0
 [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.0 because...

 Thanks,
 Wes

 How to validate a release signature:
 https://httpd.apache.org/dev/verification.html

 [1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js
 -0.3.0-rc0/
 [2]: https://github.com/apache/arrow/tree/7d992de1de7dd276eb9aeda
 349376e79b62da11c

>>>
>


[jira] [Created] (ARROW-2194) Pandas columns metadata incorrect for empty string columns

2018-02-21 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-2194:
-

 Summary: Pandas columns metadata incorrect for empty string columns
 Key: ARROW-2194
 URL: https://issues.apache.org/jira/browse/ARROW-2194
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Florian Jetter


The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
DataFrame is unexpectedly {{float64}}

 
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
import json

empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
np.array([], dtype=np.bytes_)})
empty_table = pa.Table.from_pandas(empty_df)
json.loads(empty_table.schema.metadata[b'pandas'])['columns']

# Same behavior for input dtype np.unicode_
[{u'field_name': u'bytes',
u'metadata': None,
u'name': u'bytes',
u'numpy_type': u'object',
u'pandas_type': u'float64'},
{u'field_name': u'unicode',
u'metadata': None,
u'name': u'unicode',
u'numpy_type': u'object',
u'pandas_type': u'float64'},
{u'field_name': u'__index_level_0__',
u'metadata': None,
u'name': None,
u'numpy_type': u'int64',
u'pandas_type': u'int64'}]{code}
 

Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow sync today at 17:00 UTC

2018-02-21 Thread Jacques Nadeau
 My notes from our syncup.

*Attendees*
Wes: Discuss 0.9/1.0
Jacques: Nothing in particular.
Sidd: Nothing in particular.
Li: Binary format stability.
Simba: Discuss release
Uwe: 0.9 release, UUID type
Kevin: Nothing in particular.

*Format Stability*
Interval type and unions need work/integration tests. Other types are
probably stable but no guarantees until 1.0. Only guarantee that we have
now is that a format change will cause an error (for example, the format
version changed in 0.8).
As part of this we need to verify if Java is verifying binary format.

*Release 0.9*
Work towards getting a candidate up next week. Sidd “volunteered” to be
release manager. People are going to get what they can done before next
week.

*Type Annotations*
Uwe will look to propose an structured way to annotate types for things
like UUID. Need to decide where it can be applied  (list of structs, leaf
values, etc). Also, what validation will library do. Wes mentioned that the
recent work in Parquet may be good inspiration.

*Needs for 1.0*
Finalize Interval and Union
Java implementation of map type and integration tests.
C++ complete the implementation for map type.
C++ implementation of fixed size list.


On Wed, Feb 21, 2018 at 8:51 AM, Wes McKinney  wrote:

> https://meet.google.com/vtm-teks-phx
>


Java implementation of MapType

2018-02-21 Thread Bryan Cutler
I can't make the sync today, but I did want to ask if everyone is ok with
implementing the MapType in Java, which would add the MapVector class. Is
this something we can do now or do we want to wait until the next release
to give time for the previous rename to settle?  Thanks!


Re: Add a UUID type to the Arrow format

2018-02-21 Thread Wes McKinney
One possibility is adding type annotations sort of in the style of the
Parquet format. So these would run parallel to the types in
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194

On Wed, Feb 21, 2018 at 11:53 AM, Jacques Nadeau  wrote:
> I think we should consider at introducing these "business" types
> differently. Same could be said for a US zipcode type, for example.
>
> On Thu, Feb 15, 2018 at 6:36 PM, Wes McKinney  wrote:
>
>> hi Uwe,
>>
>> This seems like a good idea to me given the widespread use of UUIDs,
>> and would make use more natural for application developers.
>>
>> - Wes
>>
>> On Tue, Feb 13, 2018 at 10:03 AM, Uwe L. Korn  wrote:
>> > Hello,
>> >
>> > I just opened https://issues.apache.org/jira/browse/ARROW-2152 to start
>> the discussion about adding a UUID type to the Arrow format specification.
>> In its essence a UUID is simply a 128bit array but there are often special
>> classes used for it, e.g. java.util.UUID in Java and uuid.UUID in Python.
>> These provide special functions for them as well as sometimes the knowledge
>> that a column is a UUID could be beneficial during computations. Other data
>> systems like Postgres or Parquet also have a special UUID type.
>> >
>> > While there is only a small difference to a 128bit fixed sized binary
>> array, I think providing the respective object model accessor is already a
>> good benefit.
>> >
>> > Uwe
>>


Re: Add a UUID type to the Arrow format

2018-02-21 Thread Jacques Nadeau
I think we should consider at introducing these "business" types
differently. Same could be said for a US zipcode type, for example.

On Thu, Feb 15, 2018 at 6:36 PM, Wes McKinney  wrote:

> hi Uwe,
>
> This seems like a good idea to me given the widespread use of UUIDs,
> and would make use more natural for application developers.
>
> - Wes
>
> On Tue, Feb 13, 2018 at 10:03 AM, Uwe L. Korn  wrote:
> > Hello,
> >
> > I just opened https://issues.apache.org/jira/browse/ARROW-2152 to start
> the discussion about adding a UUID type to the Arrow format specification.
> In its essence a UUID is simply a 128bit array but there are often special
> classes used for it, e.g. java.util.UUID in Java and uuid.UUID in Python.
> These provide special functions for them as well as sometimes the knowledge
> that a column is a UUID could be beneficial during computations. Other data
> systems like Postgres or Parquet also have a special UUID type.
> >
> > While there is only a small difference to a 128bit fixed sized binary
> array, I think providing the respective object model accessor is already a
> good benefit.
> >
> > Uwe
>


Arrow sync today at 17:00 UTC

2018-02-21 Thread Wes McKinney
https://meet.google.com/vtm-teks-phx


Re: [Python] Retrieving a RecordBatch from plasma inside a function

2018-02-21 Thread Wes McKinney
Can we create a JIRA to track this issue?

On Wed, Feb 21, 2018 at 5:04 AM, ALBERTO Bocchinfuso
 wrote:
> Hi,
>
> Have you had any news on this issue?
> Do you plan to solve it for the next releases of Arrow, or is there any way 
> to avoid the problem?
>
> Thanks in advance,
> Alberto
> Da: Philipp Moritz
> Inviato: venerdì 9 febbraio 2018 00:30
> A: dev@arrow.apache.org
> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function
>
> Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
> plan to look into it on the weekend.
>
> Here is the preliminary backtrace for everybody interested:
>
> CESS (code=1, address=0x38158)
>
> frame #0: 0x00010e6457fc
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
>
> ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
>
> 0x10e645800 <+32>: callq  0x10e698170   ; symbol stub for:
> PyInt_FromLong
>
> 0x10e645805 <+37>: testq  %rax, %rax
>
> 0x10e645808 <+40>: je 0x10e64580c   ; <+44>
>
> (lldb) bt
>
> * thread #1: tid = 0xf1378e, 0x00010e6457fc
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28,
> queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
> address=0x38158)
>
>   * frame #0: 0x00010e6457fc
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>
> frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*)
> + 133
>
> frame #2: 0x00010e613b25
> lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
>
> frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
>
> frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
> 22305
>
> On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
> alberto_boc...@hotmail.it> wrote:
>
>> Hi,
>>
>> I’m using python 3.5.2 and pyarrow 0.8.0
>>
>> As key, I put a string of 20 bytes, of course. I’m doing it differently
>> from the canonical way since I’m no more using python 2.7, but python 3,
>> and this seemed to me to be the right way to create a string of 20 bytes.
>> The full code is:
>>
>> import pyarrow as pa
>> import pyarrow.plasma as plasma
>>
>> def retrieve1():
>>  client = plasma.connect('test', "", 0)
>>
>>  key = "keynumber1keynumber1"
>>  pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>>
>>  [buff] = client .get_buffers([pid])
>>  batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>>
>>  print(batch)
>>  print(batch.schema)
>>  print(batch[0])
>>
>>  return batch
>>
>> client = plasma.connect('test', "", 0)
>>
>> test1 = [1, 12, 23, 3, 21, 34]
>> test1 = pa.array(test1, pa.int32())
>>
>> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>>
>> key = "keynumber1keynumber1"
>> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>> sink = pa.MockOutputStream()
>> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
>> stream_writer.write_batch(batch)
>> stream_writer.close()
>>
>> bff = client.create(pid, sink.size())
>>
>> stream = pa.FixedSizeBufferWriter(bff)
>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
>> writer.write_batch(batch)
>> client.seal(pid)
>>
>> batch = retrieve1()
>> print(batch)
>> print(batch.schema)
>> print(batch[0])
>>
>> I hope this helps,
>> thank you
>>
>> Da: Philipp Moritz
>> Inviato: martedì 6 febbraio 2018 00:00
>> A: dev@arrow.apache.org
>> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
>> function
>>
>> Hey Alberto,
>>
>> Thanks for your message! I'm trying to reproduce it.
>>
>> Can you attach the code you use to write the batch into the store?
>>
>> Also can you say which version of Python and Arrow you are using? On my
>> installation, I get
>>
>> ```
>>
>> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>>
>> 
>> ---
>>
>> ValueErrorTraceback (most recent call last)
>>
>>  in ()
>>
>> > 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>>
>>
>> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
>>
>>
>> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
>> ```
>>
>> (the canonical way to do this would be plasma.ObjectID(b
>> "keynumber1keynumber1"))
>>
>> Best,
>> Philipp.
>>
>> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
>> alberto_boc...@hotmail.it> wrote:
>>
>> > Good morning,
>> >
>> > I am experiencing problems with the RecordBatches stored in plasma in a
>> > particular situation.
>> >
>> > If I return a RecordBatch as result of a python function, I am able to
>> > read just the metadata, while I get an error 

Re: FW: [jira] [Updated] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-21 Thread Atul Dambalkar
Thanks Wes.

Sent from my Samsung Galaxy smartphone.


 Original message 
From: Wes McKinney 
Date: 2/21/18 7:37 AM (GMT-08:00)
To: dev@arrow.apache.org
Subject: Re: FW: [jira] [Updated] (ARROW-1780) JDBC Adapter for Apache Arrow

hi Atul -- I added you to the contributor role in JIRA and assigned
the issue to you

On Tue, Feb 20, 2018 at 11:20 PM, Atul Dambalkar
 wrote:
> Hi Uwe,
>
> In terms of process, does this bug need to be assigned to me? I tried, but I 
> couldn't get it assigned to myself. May be you or someone from Arrow team can 
> do that?
>
> -Atul
> -Original Message-
> From: Uwe L. Korn (JIRA) [mailto:j...@apache.org]
> Sent: Tuesday, February 20, 2018 12:29 PM
> To: Atul Dambalkar 
> Subject: [jira] [Updated] (ARROW-1780) JDBC Adapter for Apache Arrow
>
>
>  [ 
> https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Uwe L. Korn updated ARROW-1780:
> ---
> Fix Version/s: 0.10.0
>
>> JDBC Adapter for Apache Arrow
>> -
>>
>> Key: ARROW-1780
>> URL: https://issues.apache.org/jira/browse/ARROW-1780
>> Project: Apache Arrow
>>  Issue Type: New Feature
>>Reporter: Atul Dambalkar
>>Priority: Major
>> Fix For: 0.10.0
>>
>>
>> At a high level the JDBC Adapter will allow upstream apps to query
>> RDBMS data over JDBC and get the JDBC objects converted to Arrow
>> objects/structures. The upstream utility can then work with Arrow
>> objects/structures with usual performance benefits. The utility will
>> be very much similar to C++ implementation of "Convert a vector of
>> row-wise data into an Arrow table" as described here -
>> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.htm
>> l The utility will read data from RDBMS and covert the data into Arrow
>> objects/structures. So from that perspective this will Read data from RDBMS, 
>> If the utility can push Arrow objects to RDBMS is something need to be 
>> discussed and will be out of scope for this utility for now.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)


Re: FW: [jira] [Updated] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-21 Thread Wes McKinney
hi Atul -- I added you to the contributor role in JIRA and assigned
the issue to you

On Tue, Feb 20, 2018 at 11:20 PM, Atul Dambalkar
 wrote:
> Hi Uwe,
>
> In terms of process, does this bug need to be assigned to me? I tried, but I 
> couldn't get it assigned to myself. May be you or someone from Arrow team can 
> do that?
>
> -Atul
> -Original Message-
> From: Uwe L. Korn (JIRA) [mailto:j...@apache.org]
> Sent: Tuesday, February 20, 2018 12:29 PM
> To: Atul Dambalkar 
> Subject: [jira] [Updated] (ARROW-1780) JDBC Adapter for Apache Arrow
>
>
>  [ 
> https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Uwe L. Korn updated ARROW-1780:
> ---
> Fix Version/s: 0.10.0
>
>> JDBC Adapter for Apache Arrow
>> -
>>
>> Key: ARROW-1780
>> URL: https://issues.apache.org/jira/browse/ARROW-1780
>> Project: Apache Arrow
>>  Issue Type: New Feature
>>Reporter: Atul Dambalkar
>>Priority: Major
>> Fix For: 0.10.0
>>
>>
>> At a high level the JDBC Adapter will allow upstream apps to query
>> RDBMS data over JDBC and get the JDBC objects converted to Arrow
>> objects/structures. The upstream utility can then work with Arrow
>> objects/structures with usual performance benefits. The utility will
>> be very much similar to C++ implementation of "Convert a vector of
>> row-wise data into an Arrow table" as described here -
>> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.htm
>> l The utility will read data from RDBMS and covert the data into Arrow
>> objects/structures. So from that perspective this will Read data from RDBMS, 
>> If the utility can push Arrow objects to RDBMS is something need to be 
>> discussed and will be out of scope for this utility for now.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)


[jira] [Created] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2193:
-

 Summary: [Plasma] plasma_store forks endlessly
 Key: ARROW-2193
 URL: https://issues.apache.org/jira/browse/ARROW-2193
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Antoine Pitrou


I'm not sure why, but when I run the pyarrow test suite (for example {{py.test 
pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:

{code:bash}
 $ ps fuwww
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
[...]
antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
[etc.]
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


R: [Python] Retrieving a RecordBatch from plasma inside a function

2018-02-21 Thread ALBERTO Bocchinfuso
Hi,

Have you had any news on this issue?
Do you plan to solve it for the next releases of Arrow, or is there any way to 
avoid the problem?

Thanks in advance,
Alberto
Da: Philipp Moritz
Inviato: venerdì 9 febbraio 2018 00:30
A: dev@arrow.apache.org
Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function

Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
plan to look into it on the weekend.

Here is the preliminary backtrace for everybody interested:

CESS (code=1, address=0x38158)

frame #0: 0x00010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

0x10e645800 <+32>: callq  0x10e698170   ; symbol stub for:
PyInt_FromLong

0x10e645805 <+37>: testq  %rax, %rax

0x10e645808 <+40>: je 0x10e64580c   ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x00010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28,
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x38158)

  * frame #0: 0x00010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*)
+ 133

frame #2: 0x00010e613b25
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
22305

On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
alberto_boc...@hotmail.it> wrote:

> Hi,
>
> I’m using python 3.5.2 and pyarrow 0.8.0
>
> As key, I put a string of 20 bytes, of course. I’m doing it differently
> from the canonical way since I’m no more using python 2.7, but python 3,
> and this seemed to me to be the right way to create a string of 20 bytes.
> The full code is:
>
> import pyarrow as pa
> import pyarrow.plasma as plasma
>
> def retrieve1():
>  client = plasma.connect('test', "", 0)
>
>  key = "keynumber1keynumber1"
>  pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
>  [buff] = client .get_buffers([pid])
>  batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>
>  print(batch)
>  print(batch.schema)
>  print(batch[0])
>
>  return batch
>
> client = plasma.connect('test', "", 0)
>
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
>
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
>
> bff = client.create(pid, sink.size())
>
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> I hope this helps,
> thank you
>
> Da: Philipp Moritz
> Inviato: martedì 6 febbraio 2018 00:00
> A: dev@arrow.apache.org
> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> function
>
> Hey Alberto,
>
> Thanks for your message! I'm trying to reproduce it.
>
> Can you attach the code you use to write the batch into the store?
>
> Also can you say which version of Python and Arrow you are using? On my
> installation, I get
>
> ```
>
> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
> 
> ---
>
> ValueErrorTraceback (most recent call last)
>
>  in ()
>
> > 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
>
> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
>
>
> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
> ```
>
> (the canonical way to do this would be plasma.ObjectID(b
> "keynumber1keynumber1"))
>
> Best,
> Philipp.
>
> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
> alberto_boc...@hotmail.it> wrote:
>
> > Good morning,
> >
> > I am experiencing problems with the RecordBatches stored in plasma in a
> > particular situation.
> >
> > If I return a RecordBatch as result of a python function, I am able to
> > read just the metadata, while I get an error when reading the columns.
> >
> > For example, the following code
> > def retrieve1():
> > client = plasma.connect('test', "", 0)
> >
> > key = "keynumber1keynumber1"
> > pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >
> > [buff] = client .get_buffers([pid])
> > batch =