[jira] [Created] (ARROW-2595) [Plasma] operator[] creates entries in map

2018-05-16 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2595:
-

 Summary: [Plasma] operator[] creates entries in map
 Key: ARROW-2595
 URL: https://issues.apache.org/jira/browse/ARROW-2595
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz


* Problem

 ** Using object_get_requests_[object_id] will produce a lot of garbage data in 
PlasmaStore::return_from_get. During the measurement process, we found that 
there was a lot of memory growth in this point.
 * Solution

 ** Use iterator instead of operator []



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Robert Nishihara
You're welcome!

On Wed, May 16, 2018 at 6:13 PM Corey Nolet  wrote:

> I must say, I’m super excited about using Arrow and Plasma.
>
> The code you just posted worked for me at home and I’m sure I’ll figure
> out what I was doing wrong tomorrow at work.
>
> Anyways, thanks so much for your help and fast replies!
>
> Sent from my iPhone
>
> > On May 16, 2018, at 7:42 PM, Robert Nishihara 
> wrote:
> >
> > You should be able to do something like the following.
> >
> > # Start the store.
> > plasma_store -s /tmp/store -m 10
> >
> > Then in Python, do the following:
> >
> > import pandas as pd
> > import pyarrow.plasma as plasma
> > import numpy as np
> >
> > client = plasma.connect('/tmp/store', '', 0)
> > series = pd.Series(np.zeros(100))
> > object_id = client.put(series)
> >
> > And yes, I would create a separate Plasma client for each process. I
> don't
> > think you'll be able to pickle a Plasma client object successfully (it
> has
> > a socket connection to the store).
> >
> > On Wed, May 16, 2018 at 3:43 PM Corey Nolet  wrote:
> >
> >> Robert,
> >>
> >> Thank you for the quick response. I've been playing around for a few
> hours
> >> to get a feel for how this works.
> >>
> >> If I understand correctly, it's better to have the Plasma client objects
> >> instantiated within each separate process? Weird things seemed to happen
> >> when I attempted to share a single one. I was assuming that the pickle
> >> serialization by python multiprocessing would have been serializing the
> >> connection info and re-instantiating on the other side but that didn't
> seem
> >> to be the case.
> >>
> >> I managed to load up a gigantic set of CSV files into Dataframes. Now
> I'm
> >> attempting to read the chunks, perform a groupby-aggregate, and write
> the
> >> results back to the Plasma store. Unless I'm mistaken, there doesn't
> seem
> >> to be a very direct way of accomplishing this. When I tried converting
> the
> >> Series object into a Plasma Array and just doing a client.put(array) I
> get
> >> a pickling error. Unless maybe I'm misunderstanding the architecture
> here,
> >> I believe that error would have been referring to attempts to serialize
> the
> >> object into a file? I would hope that the data isn't all being sent to
> the
> >> single Plasma server (or sent over sockets for that matter).
> >>
> >> What would be the recommended strategy for serializing Pandas Series
> >> objects? I really like the StreamWriter concept here but there does not
> >> seem to be a direct way (or documentation) to accomplish this.
> >>
> >> Thanks again.
> >>
> >> On Wed, May 16, 2018 at 1:28 PM, Robert Nishihara <
> >> robertnishih...@gmail.com
> >>> wrote:
> >>
> >>> Take a look at the Plasma object store
> >>> https://arrow.apache.org/docs/python/plasma.html.
> >>>
> >>> Here's an example using it (along with multiprocessing to sort a pandas
> >>> dataframe)
> >>> https://github.com/apache/arrow/blob/master/python/
> >>> examples/plasma/sorting/sort_df.py.
> >>> It's possible the example is a bit out of date.
> >>>
> >>> You may be interested in taking a look at Ray
> >>> https://github.com/ray-project/ray. We use Plasma/Arrow under the hood
> >> to
> >>> do all of these things but hide a lot of the bookkeeping (like object
> ID
> >>> generation). For your setting, you can think of it as a replacement for
> >>> Python multiprocessing that automatically uses shared memory and Arrow
> >> for
> >>> serialization.
> >>>
>  On Wed, May 16, 2018 at 10:02 AM Corey Nolet 
> wrote:
> 
>  I've been reading through the PyArrow documentation and trying to
>  understand how to use the tool effectively for IPC (using zero-copy).
> 
>  I'm on a system with 586 cores & 1TB of ram. I'm using Panda's
> >> Dataframes
>  to process several 10's of gigs of data in memory and the pickling
> that
> >>> is
>  done by Python's multiprocessing API is very wasteful.
> 
>  I'm running a little hand-built map-reduce where I chunk the dataframe
> >>> into
>  N_mappers number of chunks, run some processing on them, then run some
>  number N_reducers to finalize the operation. What I'd like to be able
> >> to
> >>> do
>  is chunk up the dataframe into Arrow Buffer objects and just have each
>  mapped task read their respective Buffer object with the guarantee of
>  zero-copy.
> 
>  I see there's a couple Filesystem abstractions for doing memory-mapped
>  files. Durability isn't something I need and I'm willing to forego the
>  expense of putting the files on disk.
> 
>  Is it possible to write the data directly to memory and pass just the
>  reference around to the different processes? What's the recommended
> way
> >>> to
>  accomplish my goal here?
> 
> 
>  Thanks in advance!
> 
> >>>
> >>
>


Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Corey Nolet
I must say, I’m super excited about using Arrow and Plasma.

The code you just posted worked for me at home and I’m sure I’ll figure out 
what I was doing wrong tomorrow at work. 

Anyways, thanks so much for your help and fast replies! 

Sent from my iPhone

> On May 16, 2018, at 7:42 PM, Robert Nishihara  
> wrote:
> 
> You should be able to do something like the following.
> 
> # Start the store.
> plasma_store -s /tmp/store -m 10
> 
> Then in Python, do the following:
> 
> import pandas as pd
> import pyarrow.plasma as plasma
> import numpy as np
> 
> client = plasma.connect('/tmp/store', '', 0)
> series = pd.Series(np.zeros(100))
> object_id = client.put(series)
> 
> And yes, I would create a separate Plasma client for each process. I don't
> think you'll be able to pickle a Plasma client object successfully (it has
> a socket connection to the store).
> 
> On Wed, May 16, 2018 at 3:43 PM Corey Nolet  wrote:
> 
>> Robert,
>> 
>> Thank you for the quick response. I've been playing around for a few hours
>> to get a feel for how this works.
>> 
>> If I understand correctly, it's better to have the Plasma client objects
>> instantiated within each separate process? Weird things seemed to happen
>> when I attempted to share a single one. I was assuming that the pickle
>> serialization by python multiprocessing would have been serializing the
>> connection info and re-instantiating on the other side but that didn't seem
>> to be the case.
>> 
>> I managed to load up a gigantic set of CSV files into Dataframes. Now I'm
>> attempting to read the chunks, perform a groupby-aggregate, and write the
>> results back to the Plasma store. Unless I'm mistaken, there doesn't seem
>> to be a very direct way of accomplishing this. When I tried converting the
>> Series object into a Plasma Array and just doing a client.put(array) I get
>> a pickling error. Unless maybe I'm misunderstanding the architecture here,
>> I believe that error would have been referring to attempts to serialize the
>> object into a file? I would hope that the data isn't all being sent to the
>> single Plasma server (or sent over sockets for that matter).
>> 
>> What would be the recommended strategy for serializing Pandas Series
>> objects? I really like the StreamWriter concept here but there does not
>> seem to be a direct way (or documentation) to accomplish this.
>> 
>> Thanks again.
>> 
>> On Wed, May 16, 2018 at 1:28 PM, Robert Nishihara <
>> robertnishih...@gmail.com
>>> wrote:
>> 
>>> Take a look at the Plasma object store
>>> https://arrow.apache.org/docs/python/plasma.html.
>>> 
>>> Here's an example using it (along with multiprocessing to sort a pandas
>>> dataframe)
>>> https://github.com/apache/arrow/blob/master/python/
>>> examples/plasma/sorting/sort_df.py.
>>> It's possible the example is a bit out of date.
>>> 
>>> You may be interested in taking a look at Ray
>>> https://github.com/ray-project/ray. We use Plasma/Arrow under the hood
>> to
>>> do all of these things but hide a lot of the bookkeeping (like object ID
>>> generation). For your setting, you can think of it as a replacement for
>>> Python multiprocessing that automatically uses shared memory and Arrow
>> for
>>> serialization.
>>> 
 On Wed, May 16, 2018 at 10:02 AM Corey Nolet  wrote:
 
 I've been reading through the PyArrow documentation and trying to
 understand how to use the tool effectively for IPC (using zero-copy).
 
 I'm on a system with 586 cores & 1TB of ram. I'm using Panda's
>> Dataframes
 to process several 10's of gigs of data in memory and the pickling that
>>> is
 done by Python's multiprocessing API is very wasteful.
 
 I'm running a little hand-built map-reduce where I chunk the dataframe
>>> into
 N_mappers number of chunks, run some processing on them, then run some
 number N_reducers to finalize the operation. What I'd like to be able
>> to
>>> do
 is chunk up the dataframe into Arrow Buffer objects and just have each
 mapped task read their respective Buffer object with the guarantee of
 zero-copy.
 
 I see there's a couple Filesystem abstractions for doing memory-mapped
 files. Durability isn't something I need and I'm willing to forego the
 expense of putting the files on disk.
 
 Is it possible to write the data directly to memory and pass just the
 reference around to the different processes? What's the recommended way
>>> to
 accomplish my goal here?
 
 
 Thanks in advance!
 
>>> 
>> 


Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Robert Nishihara
You should be able to do something like the following.

# Start the store.
plasma_store -s /tmp/store -m 10

Then in Python, do the following:

import pandas as pd
import pyarrow.plasma as plasma
import numpy as np

client = plasma.connect('/tmp/store', '', 0)
series = pd.Series(np.zeros(100))
object_id = client.put(series)

And yes, I would create a separate Plasma client for each process. I don't
think you'll be able to pickle a Plasma client object successfully (it has
a socket connection to the store).

On Wed, May 16, 2018 at 3:43 PM Corey Nolet  wrote:

> Robert,
>
> Thank you for the quick response. I've been playing around for a few hours
> to get a feel for how this works.
>
> If I understand correctly, it's better to have the Plasma client objects
> instantiated within each separate process? Weird things seemed to happen
> when I attempted to share a single one. I was assuming that the pickle
> serialization by python multiprocessing would have been serializing the
> connection info and re-instantiating on the other side but that didn't seem
> to be the case.
>
> I managed to load up a gigantic set of CSV files into Dataframes. Now I'm
> attempting to read the chunks, perform a groupby-aggregate, and write the
> results back to the Plasma store. Unless I'm mistaken, there doesn't seem
> to be a very direct way of accomplishing this. When I tried converting the
> Series object into a Plasma Array and just doing a client.put(array) I get
> a pickling error. Unless maybe I'm misunderstanding the architecture here,
> I believe that error would have been referring to attempts to serialize the
> object into a file? I would hope that the data isn't all being sent to the
> single Plasma server (or sent over sockets for that matter).
>
> What would be the recommended strategy for serializing Pandas Series
> objects? I really like the StreamWriter concept here but there does not
> seem to be a direct way (or documentation) to accomplish this.
>
> Thanks again.
>
> On Wed, May 16, 2018 at 1:28 PM, Robert Nishihara <
> robertnishih...@gmail.com
> > wrote:
>
> > Take a look at the Plasma object store
> > https://arrow.apache.org/docs/python/plasma.html.
> >
> > Here's an example using it (along with multiprocessing to sort a pandas
> > dataframe)
> > https://github.com/apache/arrow/blob/master/python/
> > examples/plasma/sorting/sort_df.py.
> > It's possible the example is a bit out of date.
> >
> > You may be interested in taking a look at Ray
> > https://github.com/ray-project/ray. We use Plasma/Arrow under the hood
> to
> > do all of these things but hide a lot of the bookkeeping (like object ID
> > generation). For your setting, you can think of it as a replacement for
> > Python multiprocessing that automatically uses shared memory and Arrow
> for
> > serialization.
> >
> > On Wed, May 16, 2018 at 10:02 AM Corey Nolet  wrote:
> >
> > > I've been reading through the PyArrow documentation and trying to
> > > understand how to use the tool effectively for IPC (using zero-copy).
> > >
> > > I'm on a system with 586 cores & 1TB of ram. I'm using Panda's
> Dataframes
> > > to process several 10's of gigs of data in memory and the pickling that
> > is
> > > done by Python's multiprocessing API is very wasteful.
> > >
> > > I'm running a little hand-built map-reduce where I chunk the dataframe
> > into
> > > N_mappers number of chunks, run some processing on them, then run some
> > > number N_reducers to finalize the operation. What I'd like to be able
> to
> > do
> > > is chunk up the dataframe into Arrow Buffer objects and just have each
> > > mapped task read their respective Buffer object with the guarantee of
> > > zero-copy.
> > >
> > > I see there's a couple Filesystem abstractions for doing memory-mapped
> > > files. Durability isn't something I need and I'm willing to forego the
> > > expense of putting the files on disk.
> > >
> > > Is it possible to write the data directly to memory and pass just the
> > > reference around to the different processes? What's the recommended way
> > to
> > > accomplish my goal here?
> > >
> > >
> > > Thanks in advance!
> > >
> >
>


[jira] [Created] (ARROW-2594) Vector reallocation does not properly clear reused buffers

2018-05-16 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-2594:
---

 Summary: Vector reallocation does not properly clear reused buffers
 Key: ARROW-2594
 URL: https://issues.apache.org/jira/browse/ARROW-2594
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: Bryan Cutler
Assignee: Bryan Cutler


When reallocating a vector buffer, it assumes that the first half of the new 
buffer was clean or populated from the previous and only zeros out the second 
half.  This is not the case if the vector has released the buffer and the 
current capacity is 0 (empty).  If the new buffer has values set, then they 
will cause bogus values when used in the vector.

I came across this when looking into SPARK-23030, due to the comment here 
https://github.com/apache/spark/pull/21312#issuecomment-389035697



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Corey Nolet
Robert,

Thank you for the quick response. I've been playing around for a few hours
to get a feel for how this works.

If I understand correctly, it's better to have the Plasma client objects
instantiated within each separate process? Weird things seemed to happen
when I attempted to share a single one. I was assuming that the pickle
serialization by python multiprocessing would have been serializing the
connection info and re-instantiating on the other side but that didn't seem
to be the case.

I managed to load up a gigantic set of CSV files into Dataframes. Now I'm
attempting to read the chunks, perform a groupby-aggregate, and write the
results back to the Plasma store. Unless I'm mistaken, there doesn't seem
to be a very direct way of accomplishing this. When I tried converting the
Series object into a Plasma Array and just doing a client.put(array) I get
a pickling error. Unless maybe I'm misunderstanding the architecture here,
I believe that error would have been referring to attempts to serialize the
object into a file? I would hope that the data isn't all being sent to the
single Plasma server (or sent over sockets for that matter).

What would be the recommended strategy for serializing Pandas Series
objects? I really like the StreamWriter concept here but there does not
seem to be a direct way (or documentation) to accomplish this.

Thanks again.

On Wed, May 16, 2018 at 1:28 PM, Robert Nishihara  wrote:

> Take a look at the Plasma object store
> https://arrow.apache.org/docs/python/plasma.html.
>
> Here's an example using it (along with multiprocessing to sort a pandas
> dataframe)
> https://github.com/apache/arrow/blob/master/python/
> examples/plasma/sorting/sort_df.py.
> It's possible the example is a bit out of date.
>
> You may be interested in taking a look at Ray
> https://github.com/ray-project/ray. We use Plasma/Arrow under the hood to
> do all of these things but hide a lot of the bookkeeping (like object ID
> generation). For your setting, you can think of it as a replacement for
> Python multiprocessing that automatically uses shared memory and Arrow for
> serialization.
>
> On Wed, May 16, 2018 at 10:02 AM Corey Nolet  wrote:
>
> > I've been reading through the PyArrow documentation and trying to
> > understand how to use the tool effectively for IPC (using zero-copy).
> >
> > I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes
> > to process several 10's of gigs of data in memory and the pickling that
> is
> > done by Python's multiprocessing API is very wasteful.
> >
> > I'm running a little hand-built map-reduce where I chunk the dataframe
> into
> > N_mappers number of chunks, run some processing on them, then run some
> > number N_reducers to finalize the operation. What I'd like to be able to
> do
> > is chunk up the dataframe into Arrow Buffer objects and just have each
> > mapped task read their respective Buffer object with the guarantee of
> > zero-copy.
> >
> > I see there's a couple Filesystem abstractions for doing memory-mapped
> > files. Durability isn't something I need and I'm willing to forego the
> > expense of putting the files on disk.
> >
> > Is it possible to write the data directly to memory and pass just the
> > reference around to the different processes? What's the recommended way
> to
> > accomplish my goal here?
> >
> >
> > Thanks in advance!
> >
>


[RESULT] [VOTE] Accept donation of Arrow Ruby bindings

2018-05-16 Thread Wes McKinney
With 5 binding +1 votes, the vote passes. Thanks all!

I will work with Kou to finish the IP clearance and merge the pull request

On Wed, May 16, 2018 at 9:46 AM, P. Taylor Goetz  wrote:
> +1
>
> I’ve been through IP clearance a few times, and can help if needed.
>
> -Taylor
>
>> On May 11, 2018, at 6:47 PM, Wes McKinney  wrote:
>>
>> Dear all,
>>
>> Arrow PMC member Kouhei Sutou has developed Ruby bindings to the GLib
>> C interface for Apache Arrow
>>
>> * https://github.com/red-data-tools/red-arrow
>> * https://github.com/red-data-tools/red-arrow-gpu
>>
>> He is proposing to pull these projects into Apache Arrow to develop
>> them all in the same place
>>
>> https://github.com/apache/arrow/pull/1990
>>
>> We are proposing to accept this code into the Apache project. If the
>> vote passes, the PMC and Kou will work together to complete the ASF IP
>> Clearance process (http://incubator.apache.org/ip-clearance/) and
>> import the Ruby bindings for inclusion in a future release:
>>
>>[ ] +1 : Accept contribution of Ruby bindings
>>[ ]  0 : No opinion
>>[ ] -1 : Reject contribution because...
>>
>> Here is my vote: +1
>>
>> The vote will be open for at least 72 hours.
>>
>> Thanks,
>> Wes


[jira] [Created] (ARROW-2593) [Python] TypeError: data type "mixed-integer" not understood

2018-05-16 Thread Dima Ryazanov (JIRA)
Dima Ryazanov created ARROW-2593:


 Summary: [Python] TypeError: data type "mixed-integer" not 
understood
 Key: ARROW-2593
 URL: https://issues.apache.org/jira/browse/ARROW-2593
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Dima Ryazanov


Pyarrow 0.9 raises an exception when converting some tables to pandas 
dataframes. Earlier versions work fine. Repro steps:

{{In [1]: import pandas as pd}}

{{In [2]: import pyarrow as pa}}

{{In [3]: df = pd.DataFrame(\{'foo': [], 123: []})}}

{{In [4]: table = pa.Table.from_pandas(df)}}

{{In [5]: table.to_pandas()}}
{{---}}
{{KeyError  Traceback (most recent call last)}}
{{~/envs/cli3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in 
_pandas_type_to_numpy_type(pandas_type)}}
{{    666 try:}}
{{--> 667 return _pandas_logical_type_map[pandas_type]}}
{{    668 except KeyError:}}

{{KeyError: 'mixed-integer'}}

(I ended up with a dataframe with mixed string/integer columns by using 
pd.read_excel(..., skiprows=[0]) - which skipped the header, and treated the 
first line of data as column names.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2592) [Python] AssertionError in to_pandas()

2018-05-16 Thread Dima Ryazanov (JIRA)
Dima Ryazanov created ARROW-2592:


 Summary: [Python] AssertionError in to_pandas()
 Key: ARROW-2592
 URL: https://issues.apache.org/jira/browse/ARROW-2592
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0, 0.8.0
Reporter: Dima Ryazanov


Pyarrow 0.8 and 0.9 raises an AssertionError for one of the datasets I have 
(created using an older version of pyarrow). Repro steps:


{{In [1]: from pyarrow.parquet import ParquetDataset}}

{{In [2]: d = ParquetDataset(['bug.parq'])}}

{{In [3]: t = d.read()}}

{{In [4]: t.to_pandas()}}
{{---}}
{{AssertionError    Traceback (most recent call last)}}
{{ in ()}}
{{> 1 t.to_pandas()}}

{{table.pxi in pyarrow.lib.Table.to_pandas()}}

{{~/envs/cli3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in 
table_to_blockmanager(options, table, memory_pool, nthreads, categories)}}
{{    529 # There must be the same number of field names and physical 
names}}
{{    530 # (fields in the arrow Table)}}
{{--> 531 assert len(logical_index_names) == len(index_columns_set)}}
{{    532 }}
{{    533 # It can never be the case in a released version of pyarrow that}}

{{AssertionError: }}

 

Here's the file: [https://www.dropbox.com/s/oja3khjsc5tycfh/bug.parq]

(I was not able to attach it here due to a "missing token", whatever that 
means.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Robert Nishihara
Take a look at the Plasma object store
https://arrow.apache.org/docs/python/plasma.html.

Here's an example using it (along with multiprocessing to sort a pandas
dataframe)
https://github.com/apache/arrow/blob/master/python/examples/plasma/sorting/sort_df.py.
It's possible the example is a bit out of date.

You may be interested in taking a look at Ray
https://github.com/ray-project/ray. We use Plasma/Arrow under the hood to
do all of these things but hide a lot of the bookkeeping (like object ID
generation). For your setting, you can think of it as a replacement for
Python multiprocessing that automatically uses shared memory and Arrow for
serialization.

On Wed, May 16, 2018 at 10:02 AM Corey Nolet  wrote:

> I've been reading through the PyArrow documentation and trying to
> understand how to use the tool effectively for IPC (using zero-copy).
>
> I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes
> to process several 10's of gigs of data in memory and the pickling that is
> done by Python's multiprocessing API is very wasteful.
>
> I'm running a little hand-built map-reduce where I chunk the dataframe into
> N_mappers number of chunks, run some processing on them, then run some
> number N_reducers to finalize the operation. What I'd like to be able to do
> is chunk up the dataframe into Arrow Buffer objects and just have each
> mapped task read their respective Buffer object with the guarantee of
> zero-copy.
>
> I see there's a couple Filesystem abstractions for doing memory-mapped
> files. Durability isn't something I need and I'm willing to forego the
> expense of putting the files on disk.
>
> Is it possible to write the data directly to memory and pass just the
> reference around to the different processes? What's the recommended way to
> accomplish my goal here?
>
>
> Thanks in advance!
>


Re: Arrow sync at 12pm EDT today

2018-05-16 Thread Phillip Cloud
Meeting notes from the call:

Attendees/Topics to discuss

   -

   Wes
   -

  Packaging
  -

   Uwe
   -

  Packaging
  -

   Simba
   -

   Li Two Sigma
   -

   Ethan Two Sigma
   -

   Josh Two Sigma
   -

  Exceptions vs status codes
  -

  Class design question
  -

 What data goes in parent/child classes?
 -

  Parquet arrow code location
  -

   Phillip Two Sigma
   -

   Kristizian
   -

  Packaging
  -

   Aneesh


Packaging

   -

   Questions
   -

  Where are the artifacts going?
  -

  BinTray
  -

 Each job posts an artifact to a specific project
 -

 Issues
 -

Apache bintray cannot be used
-

   Possibly, but involve infra for encrypting secrets
   -

Artifacts to upload from the apache org challenging
-

Worst case setup arrow pmc packaging project
-

  Upload to github directly?
  -

 Total size of all binaries?
 -

Issue there?
-

 LFS 2GB limit
 -

Aneesh: very slow
-

  Other options
  -

 Glacier?
 -

 Buckets?
 -

  Download all artifacts per release
  -

 Usable from github
 -

 Make sure there’s a way to list files
 -

 No-click workflow
 -

  Take crossbow for a spin
  -

  Nightly builds
  -

 Appveyor
 -

 Travis

Parent/Child Class Design Question

   -

   Arrays/Types both have a children list
   -

  Potential errors around assumptions about lists of children for
  arrays that do not have children
  -

   Non-nested types have empty vector, not necessary, perf hit but probably
   not apparent?
   -

   Nested type interface
   -

  Move the child data types to nested class?
  -

   ArrayData idea: POD for data

Parquet Arrow code

   -

   Circular dependency not ideal
   -

   Monorepo?
   -

   Duplicated build system code
   -

  Build scripts in a separate repo
  -

  Potentially more work?
  -

   Need to be careful about how we package scripts
   -

   Shared library of C++ code for apache projects?
   -

  Lots of build scripts, C++ code
  -

  Copypasted things in a few places
  -

   Potentially pull out memory pool
   -

   Mailing list thread about the circular dep issue and reduce build system
   dependency
   -

   Impala has copy pasted code from parquet-cpp apparently

Exceptions

   -

   Status codes potentially not C++
   -

   Uwe
   -

  Exceptions fail in weird ways with different compilers (not caught in
  some cases)
  -

   Josh
   -

  Status codes vs exceptions
  -

   Wes
   -

  Compromise on exceptions internally, and status codes in the public
  API
  -

   Phillip
   -

  Tool for checking for potentially uncaught exceptions

JavaScript PR: https://github.com/apache/arrow/pull/2035

   -

   JavaScript IPC writer
   -

   Streaming/File format
   -

   Needs help with alignment


On Wed, May 16, 2018 at 12:10 PM Wes McKinney  wrote:

> Here's a new Hangout:
> https://hangouts.google.com/call/RN8qAVjTdPwXmGZmMx7zAAEE. Let's talk
> there
>
> On Thu, May 17, 2018 at 1:07 AM, Krisztián Szűcs
>  wrote:
> > Same
> >
> > On May 16 2018, at 6:06 pm, Uwe L. Korn  wrote:
> >>
> >> On my side I'm waiting to someone to let me in...
> >> On Wed, May 16, 2018, at 6:05 PM, Wes McKinney wrote:
> >> > Google Meet says the meeting is full
> >> >
> >> > On Wed, May 16, 2018, 11:25 AM Alex Hagerman 
> wrote:
> >> > > Aneesh and I had some good conversations during the sprint at
> PyCon. Not
> >> > > sure if he will be on the call today to share, but I won’t be able
> to make
> >> > > it until the next call.
> >> > >
> >> > > Alex
> >> > > From: Wes McKinney
> >> > > Sent: Wednesday, May 16, 2018 11:00 AM
> >> > > To: dev@arrow.apache.org
> >> > > Subject: Arrow sync at 12pm EDT today
> >> > >
> >> > > See you at https://meet.google.com/vtm-teks-phx
>


PyArrow & Python Multiprocessing

2018-05-16 Thread Corey Nolet
I've been reading through the PyArrow documentation and trying to
understand how to use the tool effectively for IPC (using zero-copy).

I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes
to process several 10's of gigs of data in memory and the pickling that is
done by Python's multiprocessing API is very wasteful.

I'm running a little hand-built map-reduce where I chunk the dataframe into
N_mappers number of chunks, run some processing on them, then run some
number N_reducers to finalize the operation. What I'd like to be able to do
is chunk up the dataframe into Arrow Buffer objects and just have each
mapped task read their respective Buffer object with the guarantee of
zero-copy.

I see there's a couple Filesystem abstractions for doing memory-mapped
files. Durability isn't something I need and I'm willing to forego the
expense of putting the files on disk.

Is it possible to write the data directly to memory and pass just the
reference around to the different processes? What's the recommended way to
accomplish my goal here?


Thanks in advance!


[jira] [Created] (ARROW-2591) [Python] Segmentationfault issue in pq.write_table

2018-05-16 Thread jacques (JIRA)
jacques created ARROW-2591:
--

 Summary: [Python] Segmentationfault issue in pq.write_table
 Key: ARROW-2591
 URL: https://issues.apache.org/jira/browse/ARROW-2591
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0, 0.8.0
Reporter: jacques


When trying this simple code snipet I got a segmentation fault

 

{noformat}

In [1]: import pyarrow as pa

In [2]: import pyarrow.parquet as pq

In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))

In [4]: table = pa.Table.from_arrays([pa_ar],["test"])

In [5]: pq.write_table(
   ...: table=table,
   ...: where="test.parquet",
   ...: compression="snappy",
   ...: flavor="spark"
   ...: )
Segmentation fault

{noformat}

May I have it fixed.

 

Best

Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow sync at 12pm EDT today

2018-05-16 Thread Wes McKinney
Here's a new Hangout:
https://hangouts.google.com/call/RN8qAVjTdPwXmGZmMx7zAAEE. Let's talk
there

On Thu, May 17, 2018 at 1:07 AM, Krisztián Szűcs
 wrote:
> Same
>
> On May 16 2018, at 6:06 pm, Uwe L. Korn  wrote:
>>
>> On my side I'm waiting to someone to let me in...
>> On Wed, May 16, 2018, at 6:05 PM, Wes McKinney wrote:
>> > Google Meet says the meeting is full
>> >
>> > On Wed, May 16, 2018, 11:25 AM Alex Hagerman  
>> > wrote:
>> > > Aneesh and I had some good conversations during the sprint at PyCon. Not
>> > > sure if he will be on the call today to share, but I won’t be able to 
>> > > make
>> > > it until the next call.
>> > >
>> > > Alex
>> > > From: Wes McKinney
>> > > Sent: Wednesday, May 16, 2018 11:00 AM
>> > > To: dev@arrow.apache.org
>> > > Subject: Arrow sync at 12pm EDT today
>> > >
>> > > See you at https://meet.google.com/vtm-teks-phx


Re: Arrow sync at 12pm EDT today

2018-05-16 Thread Krisztián Szűcs
Same

On May 16 2018, at 6:06 pm, Uwe L. Korn  wrote:
>
> On my side I'm waiting to someone to let me in...
> On Wed, May 16, 2018, at 6:05 PM, Wes McKinney wrote:
> > Google Meet says the meeting is full
> >
> > On Wed, May 16, 2018, 11:25 AM Alex Hagerman  wrote:
> > > Aneesh and I had some good conversations during the sprint at PyCon. Not
> > > sure if he will be on the call today to share, but I won’t be able to make
> > > it until the next call.
> > >
> > > Alex
> > > From: Wes McKinney
> > > Sent: Wednesday, May 16, 2018 11:00 AM
> > > To: dev@arrow.apache.org
> > > Subject: Arrow sync at 12pm EDT today
> > >
> > > See you at https://meet.google.com/vtm-teks-phx

Re: Arrow sync at 12pm EDT today

2018-05-16 Thread Aneesh Karve
I'm stuck at "Asking to Join." Will move to better WiFi and try again.
ᐧ

On Wed, May 16, 2018 at 8:25 AM, Alex Hagerman 
wrote:

> Aneesh and I had some good conversations during the sprint at PyCon. Not
> sure if he will be on the call today to share, but I won’t be able to make
> it until the next call.
>
> Alex
>
> From: Wes McKinney
> Sent: Wednesday, May 16, 2018 11:00 AM
> To: dev@arrow.apache.org
> Subject: Arrow sync at 12pm EDT today
>
> See you at https://meet.google.com/vtm-teks-phx
>
>


-- 

Aneesh Karve | 765-360-9348 | LinkedIn  |
Twitter 



quiltdata.com | Manage data like code



Re: Arrow sync at 12pm EDT today

2018-05-16 Thread Uwe L. Korn
On my side I'm waiting to someone to let me in...

On Wed, May 16, 2018, at 6:05 PM, Wes McKinney wrote:
> Google Meet says the meeting is full
> 
> On Wed, May 16, 2018, 11:25 AM Alex Hagerman  wrote:
> 
> > Aneesh and I had some good conversations during the sprint at PyCon. Not
> > sure if he will be on the call today to share, but I won’t be able to make
> > it until the next call.
> >
> > Alex
> >
> > From: Wes McKinney
> > Sent: Wednesday, May 16, 2018 11:00 AM
> > To: dev@arrow.apache.org
> > Subject: Arrow sync at 12pm EDT today
> >
> > See you at https://meet.google.com/vtm-teks-phx
> >
> >


Re: Arrow sync at 12pm EDT today

2018-05-16 Thread Li Jin
I can't seem to get into the room. Anyone else in?

On Wed, May 16, 2018 at 12:05 PM, Wes McKinney  wrote:

> Google Meet says the meeting is full
>
> On Wed, May 16, 2018, 11:25 AM Alex Hagerman 
> wrote:
>
> > Aneesh and I had some good conversations during the sprint at PyCon. Not
> > sure if he will be on the call today to share, but I won’t be able to
> make
> > it until the next call.
> >
> > Alex
> >
> > From: Wes McKinney
> > Sent: Wednesday, May 16, 2018 11:00 AM
> > To: dev@arrow.apache.org
> > Subject: Arrow sync at 12pm EDT today
> >
> > See you at https://meet.google.com/vtm-teks-phx
> >
> >
>


Re: Arrow sync at 12pm EDT today

2018-05-16 Thread Wes McKinney
Google Meet says the meeting is full

On Wed, May 16, 2018, 11:25 AM Alex Hagerman  wrote:

> Aneesh and I had some good conversations during the sprint at PyCon. Not
> sure if he will be on the call today to share, but I won’t be able to make
> it until the next call.
>
> Alex
>
> From: Wes McKinney
> Sent: Wednesday, May 16, 2018 11:00 AM
> To: dev@arrow.apache.org
> Subject: Arrow sync at 12pm EDT today
>
> See you at https://meet.google.com/vtm-teks-phx
>
>


RE: Arrow sync at 12pm EDT today

2018-05-16 Thread Alex Hagerman
Aneesh and I had some good conversations during the sprint at PyCon. Not sure 
if he will be on the call today to share, but I won’t be able to make it until 
the next call.

Alex

From: Wes McKinney
Sent: Wednesday, May 16, 2018 11:00 AM
To: dev@arrow.apache.org
Subject: Arrow sync at 12pm EDT today

See you at https://meet.google.com/vtm-teks-phx



Arrow sync at 12pm EDT today

2018-05-16 Thread Wes McKinney
See you at https://meet.google.com/vtm-teks-phx


[jira] [Created] (ARROW-2590) Pyspark python_udf serialization error on grouped map (Amazon EMR)

2018-05-16 Thread Daniel Fithian (JIRA)
Daniel Fithian created ARROW-2590:
-

 Summary: Pyspark python_udf serialization error on grouped map 
(Amazon EMR)
 Key: ARROW-2590
 URL: https://issues.apache.org/jira/browse/ARROW-2590
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
 Environment: Amazon EMR 5.13
Spark 2.3.0
PyArrow 0.9.0 (and 0.8.0)
Pandas 0.22.0 (and 0.21.1)
Numpy 1.14.1
Reporter: Daniel Fithian


I am writing a python_udf grouped map aggregation on Spark 2.3.0 in Amazon EMR. 
When I try to run any aggregation, I get the following Python stack trace:

{{18/05/16 14:08:56 ERROR Utils: Aborting task}}
{{org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):}}
{{ File 
"/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_02/pyspark.zip/pyspark/worker.py",
 line 229, in m}}
{{ain}}
{{ process()}}
{{ File 
"/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_02/pyspark.zip/pyspark/worker.py",
 line 224, in p}}
{{rocess}}
{{ serializer.dump_stream(func(split_index, iterator), outfile)}}
{{ File 
"/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_02/pyspark.zip/pyspark/serializers.py",
 line 261,}}
{{ in dump_stream}}
{{ batch = _create_batch(series, self._timezone)}}
{{ File 
"/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_02/pyspark.zip/pyspark/serializers.py",
 line 239,}}
{{ in _create_batch}}
{{ arrs = [create_array(s, t) for s, t in series]}}
{{ File 
"/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_02/pyspark.zip/pyspark/serializers.py",
 line 239,}}
{{ in }}
{{ arrs = [create_array(s, t) for s, t in series]}}
{{ File 
"/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_02/pyspark.zip/pyspark/serializers.py",
 line 237, in create_array}}
{{ return pa.Array.from_pandas(s, mask=mask, type=t)}}
{{ File "array.pxi", line 372, in pyarrow.lib.Array.from_pandas}}
{{ File "array.pxi", line 177, in pyarrow.lib.array}}
{{ File "array.pxi", line 77, in pyarrow.lib._ndarray_to_array}}
{{ File "error.pxi", line 98, in pyarrow.lib.check_status}}
{{pyarrow.lib.ArrowException: Unknown error: 'utf-32-le' codec can't decode 
bytes in position 0-3: code point not in range(0x11)}}

To be clear, this happens when I run any aggregation, including the identity 
aggregation (return the Pandas DataFrame that was passed in). I do not get this 
error when I return an empty DataFrame, so it seems to be a symptom of the 
serialization of the Pandas DataFrame back to Spark.

I have observed this behavior with the following versions:
 * Spark 2.3.0
 * PyArrow 0.9.0 (also 0.8.0)
 * Pandas 0.22.0 (also 0.22.1)
 * Numpy 1.14.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2589) [Python] test_parquet.py regression with Pandas 0.23.0

2018-05-16 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2589:
-

 Summary: [Python] test_parquet.py regression with Pandas 0.23.0
 Key: ARROW-2589
 URL: https://issues.apache.org/jira/browse/ARROW-2589
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


See e.g. https://travis-ci.org/apache/arrow/jobs/379652352#L3124.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2588) [Plasma] Random unique ids always use the same seed

2018-05-16 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2588:
-

 Summary: [Plasma] Random unique ids always use the same seed
 Key: ARROW-2588
 URL: https://issues.apache.org/jira/browse/ARROW-2588
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Antoine Pitrou


Following GitHub PR #2039 (resolution to ARROW-2578), the random generator for 
random object ids is now using a constant default seed, meaning all processes 
will generate the same sequence of random ids:
{code:java}
$ python -c "from pyarrow import plasma; print(plasma.ObjectID.from_random())"
ObjectID(d022e7d520f8e938a14e188c47308cfef5fff7f7)
$ python -c "from pyarrow import plasma; print(plasma.ObjectID.from_random())"
ObjectID(d022e7d520f8e938a14e188c47308cfef5fff7f7)
{code}
As a sidenote, the plasma test suite should ideally test for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2587) [Python] can read StructArrays from parquet but unable to write them

2018-05-16 Thread jacques (JIRA)
jacques created ARROW-2587:
--

 Summary: [Python] can read StructArrays from parquet but unable to 
write them
 Key: ARROW-2587
 URL: https://issues.apache.org/jira/browse/ARROW-2587
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: jacques


Although I am able to read StructArray from parquet, I am still unable to write 
it back from pa.Table to parquet.

Here is a quick example

```python

In [2]: import pyarrow.parquet as pq

In [3]: table = pq.read_table('test.parquet')

In [4]: table
Out[4]: 
pyarrow.Table
weight: double
animal_type: string
animal_interpretation: struct
  child 0, is_large_animal: bool
  child 1, is_mammal: bool
metadata

{'org.apache.spark.sql.parquet.row.metadata': 
'\{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},\{"name":"animal_interpretation","type":{"type":"struct","fields":[{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}

In [5]: table.schema
Out[5]: 
weight: double
animal_type: string
animal_interpretation: struct
  child 0, is_large_animal: bool
  child 1, is_mammal: bool
metadata

{'org.apache.spark.sql.parquet.row.metadata': 
'\{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},\{"name":"animal_interpretation","type":{"type":"struct","fields":[{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}

In [6]: pq.write_table(table,"test_write.parquet")
---
ArrowInvalid  Traceback (most recent call last)
 in ()
> 1 pq.write_table(table,"test_write.parquet")

/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in 
write_table(table, where, row_group_size, version, use_dictionary, compression, 
use_deprecated_int96_timestamps, coerce_timestamps, flavor, **kwargs)
    982 use_deprecated_int96_timestamps=use_int96,
    983 **kwargs) as writer:
--> 984 writer.write_table(table, row_group_size=row_group_size)
    985 except Exception:
    986 if is_path(where):

/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in write_table(self, 
table, row_group_size)
    325 table = _sanitize_table(table, self.schema, self.flavor)
    326 assert self.is_open
--> 327 self.writer.write_table(table, row_group_size=row_group_size)
    328 
    329 def close(self):

/usr/local/lib/python2.7/dist-packages/pyarrow/_parquet.so in 
pyarrow._parquet.ParquetWriter.write_table()

/usr/local/lib/python2.7/dist-packages/pyarrow/lib.so in 
pyarrow.lib.check_status()

ArrowInvalid: Nested column branch had multiple children

```

 

I would really appreciate a fix on this.

Best,

Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)