Re: [Celebrate] Arrow has reached 2000 stargeezers

2018-05-28 Thread simba nyatsanga
Congratulations everyone!

On Mon, 28 May 2018 at 21:42 Li Jin  wrote:

> Congrats everyone!
> On Mon, May 28, 2018 at 3:21 PM Jacques Nadeau  wrote:
>
> > Woo!
> >
> > On Mon, May 28, 2018 at 4:50 PM, Wes McKinney 
> wrote:
> >
> > > Congrats all! The journey continues
> > >
> > > On Mon, May 28, 2018 at 9:43 AM, Krisztián Szűcs
> > >  wrote:
> > > > Which makes Arrow the 33rd most starred Apache repository (out of
> 1555,
> > > according to github).
> > > >
> > > > Congratulations!
> > >
> >
>


Memory mapping error on pq.read_table

2018-02-08 Thread simba nyatsanga
Hi Everyone,

I've encountered a memory mapping error when attempting to read a parquet
file to a Pandas DataFrame. It seems to be happening intermittently though,
I've so far encountered it once. In my case the pq.read_table code is being
invoked in a Linux docker container. I had a look at the docs for the
PyArrow memory and IO management here:
https://arrow.apache.org/docs/python/memory.html

What could give rise to the stacktrace below?

File "read_file.py", line 173, in load_chunked_data return
pq.read_table(data_obj_path, columns=columns).to_pandas()File
"/opt/anaconda-python-5.0.1/lib/python2.7/site-packages/pyarrow/parquet.py",
line 890, in read_table pf = ParquetFile (source,
metadata=metadata)File
"/opt/anaconda-python-5.0.1/lib/python2.7/site-packages/pyarrow/parquet.py",
line 56, in __init__ self.reader.open(source, metadata=metadata)File
"pyarrow/_parquet.pyx", line 624, in
pyarrow._parquet.ParquetReader.open
(/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:11558)
get_reader(source, _handle)File "pyarrow/io.pxi", line 798, in
pyarrow.lib.get_reader
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:58504) source =
memory_map(source, mode='r')File "pyarrow/io.pxi", line 473, in
pyarrow.lib.memory_map
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:54834)
mmap._open(path, mode)File "pyarrow/io.pxi", line 452, in
pyarrow.lib.MemoryMappedFile ._open
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:54613)
check_status(CMemoryMappedFile .Open(c_path, c_mode, ))File
"pyarrow/error.pxi", line 79, in pyarrow.lib.check_status
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) raise
ArrowIOError(message) ArrowIOError: Memory mapping file failed, errno:
22



Thanks for the help.

Kind Regards
Simba


Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-30 Thread simba nyatsanga
Hi Everyone,

Just an update on the above questions. I've updated the numbers in Google
sheet using data with less entropy here:
https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0

I've also got the benchmarking code. Although some of the data examples
might be small by web scale standards, the small sizes represent a
significant amount (but not all) of the computation inputs that I am
dealing with and felt it necessary to include benchmarking against them.
Additionally, due to the way the data is collected, low cardinality is
guaranteed but not necessarily repetition in most of the columns on data.

All the data sets and benchmarking code is reproducible here:

https://github.com/simnyatsanga/python-notebooks/blob/master/arrow_parquet_benchmark.ipynb

Hopefully this adds more clarity to the questions.

Kind Regards
Simba



On Thu, 25 Jan 2018 at 15:37 simba nyatsanga <simnyatsa...@gmail.com> wrote:

> Thanks all for the great feedback!
>
> Thanks Daniel for the sample data sets. I loaded them up and they're quite
> comparable in size to some of the data I'm dealing with. In my case the
> shapes range from 150  to ~100million rows. Column wise they range from 2-3
> columns to ~500,000 columns.
>
> Thanks Wes for the insight regarding the inverse proportion between
> entropy and Parquet's performance. I'm glad I understand why my
> benchmarking set would have skewed the results. The data sets I'm dealing
> with will be fairly random, but have very low cardinality. In that
> benchmark I used values that range from 1 to 9. So if I understand you
> correctly repetitiveness is key for Parquet's performance as opposed to
> cardinality (even though the lower the cardinality the more likely I am to
> have repeated values because of the small number of possibilities).
>
> Thanks Ted for the insight as well. Can I get some clarification when you
> said *You **also have a very small number of rows which can penalize the
> system that expects **to amortize column meta data over more data. *If I
> understand you correctly are you saying there's a column metadata overhead
> and this overhead is amortized or "paid off" when I have a large amount of
> data. If that's the the case, is the said amortization also applicable in
> the case where I used 1million rows?
>
> Kind Regards
> Simba
>
> On Wed, 24 Jan 2018 at 21:30 Daniel Lemire <lem...@gmail.com> wrote:
>
>> Here are some realistic tabular data sets...
>>
>> https://github.com/lemire/RealisticTabularDataSets
>>
>> They are small by modern standards but they are also one GitHub clone
>> away.
>>
>> - Daniel
>>
>> On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney <wesmck...@gmail.com>
>> wrote:
>>
>> > Thanks Ted. I will echo these comments and recommend to run tests on
>> > larger and preferably "real" datasets rather than randomly generated
>> > ones. The more repetition and less entropy in a dataset, the better
>> > Parquet performs relative to other storage options. Web-scale datasets
>> > often exhibit these characteristics.
>> >
>> > If you can publish your benchmarking code that would also be helpful!
>> >
>> > best
>> > Wes
>> >
>> > On Wed, Jan 24, 2018 at 1:21 PM, Ted Dunning <ted.dunn...@gmail.com>
>> > wrote:
>> > > Simba
>> > >
>> > > Nice summary. I think that there may be some issues with your tests.
>> In
>> > > particular, you are storing essentially uniform random values. That
>> might
>> > > be a viable test in some situations, there are many where there is
>> > > considerably less entropy in the data being stored. For instance, if
>> you
>> > > store measurements, it is very typical to have very strong
>> correlations.
>> > > Likewise if the rows are, say, the time evolution of an optimization.
>> You
>> > > also have a very small number of rows which can penalize system that
>> > expect
>> > > to amortize column meta data over more data.
>> > >
>> > > This test might match your situation, but I would be leery of drawing
>> > > overly broad conclusions from this single data point.
>> > >
>> > >
>> > >
>> > > On Jan 24, 2018 5:44 AM, "simba nyatsanga" <simnyatsa...@gmail.com>
>> > wrote:
>> > >
>> > >> Hi Uwe, thanks.
>> > >>
>> > >> I've attached a Google Sheet link
>> > >>
>> > >> https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-25 Thread simba nyatsanga
Thanks all for the great feedback!

Thanks Daniel for the sample data sets. I loaded them up and they're quite
comparable in size to some of the data I'm dealing with. In my case the
shapes range from 150  to ~100million rows. Column wise they range from 2-3
columns to ~500,000 columns.

Thanks Wes for the insight regarding the inverse proportion between entropy
and Parquet's performance. I'm glad I understand why my benchmarking set
would have skewed the results. The data sets I'm dealing with will be
fairly random, but have very low cardinality. In that benchmark I used
values that range from 1 to 9. So if I understand you correctly
repetitiveness is key for Parquet's performance as opposed to cardinality
(even though the lower the cardinality the more likely I am to have
repeated values because of the small number of possibilities).

Thanks Ted for the insight as well. Can I get some clarification when you
said *You **also have a very small number of rows which can penalize the
system that expects **to amortize column meta data over more data. *If I
understand you correctly are you saying there's a column metadata overhead
and this overhead is amortized or "paid off" when I have a large amount of
data. If that's the the case, is the said amortization also applicable in
the case where I used 1million rows?

Kind Regards
Simba

On Wed, 24 Jan 2018 at 21:30 Daniel Lemire <lem...@gmail.com> wrote:

> Here are some realistic tabular data sets...
>
> https://github.com/lemire/RealisticTabularDataSets
>
> They are small by modern standards but they are also one GitHub clone away.
>
> - Daniel
>
> On Wed, Jan 24, 2018 at 2:26 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Thanks Ted. I will echo these comments and recommend to run tests on
> > larger and preferably "real" datasets rather than randomly generated
> > ones. The more repetition and less entropy in a dataset, the better
> > Parquet performs relative to other storage options. Web-scale datasets
> > often exhibit these characteristics.
> >
> > If you can publish your benchmarking code that would also be helpful!
> >
> > best
> > Wes
> >
> > On Wed, Jan 24, 2018 at 1:21 PM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> > > Simba
> > >
> > > Nice summary. I think that there may be some issues with your tests. In
> > > particular, you are storing essentially uniform random values. That
> might
> > > be a viable test in some situations, there are many where there is
> > > considerably less entropy in the data being stored. For instance, if
> you
> > > store measurements, it is very typical to have very strong
> correlations.
> > > Likewise if the rows are, say, the time evolution of an optimization.
> You
> > > also have a very small number of rows which can penalize system that
> > expect
> > > to amortize column meta data over more data.
> > >
> > > This test might match your situation, but I would be leery of drawing
> > > overly broad conclusions from this single data point.
> > >
> > >
> > >
> > > On Jan 24, 2018 5:44 AM, "simba nyatsanga" <simnyatsa...@gmail.com>
> > wrote:
> > >
> > >> Hi Uwe, thanks.
> > >>
> > >> I've attached a Google Sheet link
> > >>
> > >> https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-
> > >> SoFYrRcfi1siYKFQ/edit#gid=0
> > >>
> > >> Kind Regards
> > >> Simba
> > >>
> > >> On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote:
> > >>
> > >> > Hello Simba,
> > >> >
> > >> > your plots did not come through. Try uploading them somewhere and
> link
> > >> > to them in the mails. Attachments are always stripped on Apache
> > >> > mailing lists.
> > >> > Uwe
> > >> >
> > >> >
> > >> > On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
> > >> > > Hi Everyone,
> > >> > >
> > >> > > I did some benchmarking to compare the disk size performance when
> > >> > > writing Pandas DataFrames to parquet files using Snappy and Brotli
> > >> > > compression. I then compared these numbers with those of my
> current
> > >> > > file storage solution.>
> > >> > > In my current (non Arrow+Parquet solution), every column in a
> > >> > > DataFrame is extracted as NumPy array then compressed with blosc
> and
> > >&g

Re: [Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
Hi Uwe, thanks.

I've attached a Google Sheet link

https://docs.google.com/spreadsheets/d/1by1vCaO2p24PLq_NAA5Ckh1n3i-SoFYrRcfi1siYKFQ/edit#gid=0

Kind Regards
Simba

On Wed, 24 Jan 2018 at 15:07 Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Simba,
>
> your plots did not come through. Try uploading them somewhere and link
> to them in the mails. Attachments are always stripped on Apache
> mailing lists.
> Uwe
>
>
> On Wed, Jan 24, 2018, at 1:48 PM, simba nyatsanga wrote:
> > Hi Everyone,
> >
> > I did some benchmarking to compare the disk size performance when
> > writing Pandas DataFrames to parquet files using Snappy and Brotli
> > compression. I then compared these numbers with those of my current
> > file storage solution.>
> > In my current (non Arrow+Parquet solution), every column in a
> > DataFrame is extracted as NumPy array then compressed with blosc and
> > stored as a binary file. Additionally there's a small accompanying
> > json file with some metadata. Attached are my results for several long
> > and wide DataFrames:>
> > Screen Shot 2018-01-24 at 14.40.48.png
> >
> > I was also able to correlate this finding by looking at the number of
> > allocated blocks:>
> > Screen Shot 2018-01-24 at 14.45.29.png
> >
> > From what I gather Brotli and Snappy perform significantly better for
> > wide DataFrames. However the reverse is true for long DataFrames.>
> > The DataFrames used in the benchmark are entirely composed of floats
> > and my understanding is that there's type specific encoding employed
> > on the parquet file. Additionally the compression codecs are applied
> > to individual segments of the parquet file.>
> > I'd like to get a better understanding of this disk size disparity
> > specifically if there are any additional encoding/compression headers
> > added to the parquet file in the long DataFrames case.>
> > Kind Regards
> > Simba
>
>


[Python] Disk size performance of Snappy vs Brotli vs Blosc

2018-01-24 Thread simba nyatsanga
Hi Everyone,

I did some benchmarking to compare the disk size performance when writing
Pandas DataFrames to parquet files using Snappy and Brotli compression. I
then compared these numbers with those of my current file storage solution.

In my current (non Arrow+Parquet solution), every column in a DataFrame is
extracted as NumPy array then compressed with blosc and stored as a binary
file. Additionally there's a small accompanying json file with some
metadata. Attached are my results for several long and wide DataFrames:

[image: Screen Shot 2018-01-24 at 14.40.48.png]

I was also able to correlate this finding by looking at the number of
allocated blocks:

[image: Screen Shot 2018-01-24 at 14.45.29.png]

>From what I gather Brotli and Snappy perform significantly better for wide
DataFrames. However the reverse is true for long DataFrames.

The DataFrames used in the benchmark are entirely composed of floats and my
understanding is that there's type specific encoding employed on the
parquet file. Additionally the compression codecs are applied to individual
segments of the parquet file.

I'd like to get a better understanding of this disk size disparity
specifically if there are any additional encoding/compression headers added
to the parquet file in the long DataFrames case.

Kind Regards
Simba


Re: Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

2018-01-22 Thread simba nyatsanga
Great! Thanks Wes. It's really great and interesting to see a concerted
effort to have a conversion from a language specific implementation of
common data structures into a common memory layout that can be consumed by
another language (HashMap in Java/ Hash in Ruby etc). Excited to see how
the API evolves in this regard.

On Mon, 22 Jan 2018 at 23:54 Wes McKinney <wesmck...@gmail.com> wrote:

> Note we have https://issues.apache.org/jira/browse/ARROW-1705 (and
> maybe some other JIRAs, I'd have to go digging) about improving
> support for converting Python dicts to the right Arrow memory layout.
>
> - Wes
>
> On Mon, Jan 22, 2018 at 4:50 PM, simba nyatsanga <simnyatsa...@gmail.com>
> wrote:
> > Hi Uwe,
> >
> > Thank you very much for the detailed explanation. I have a much better
> > understanding now.
> >
> > Cheers
> >
> > On Mon, 22 Jan 2018 at 19:37 Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> >> Hello Simba,
> >>
> >> find the answers inline.
> >>
> >> On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote:
> >> > Hi Everyone,
> >> >
> >> > I've got two questions that I'd like help with:
> >> >
> >> > 1. Pandas and numpy arrays can handle multiple types in a sequence
> eg. a
> >> > float and a string by using the dtype=object. From what I gather,
> Arrow
> >> > arrays enforce a uniform type depending on the type of the first
> >> > encountered element in a sequence. This looks like a deliberate choice
> >> and
> >> > I'd like to get a better understanding of the reason for ensuring this
> >> > conformity. Does making the data structure's type deterministic allow
> for
> >> > efficient pointer arithmetic when reading contiguous blocks and thus
> >> making
> >> > reading performant?
> >>
> >> As NumPy arrays, Arrow arrays are statically typed. In the case of NumPy
> >> you simply have the limitation that the type system can only represent a
> >> small number of types. Especially all these types are primitive and
> allow
> >> no nesting (e.g. you cannot implement a NumPy array of NumPy arrays of
> >> varying lengths). In NumPy you have the way to work around this
> limitation
> >> by using the object type. This simply means you have any array of
> (64bit)
> >> pointers to Python objects of which NumPy does know nothing. In the most
> >> simplistic form, you could achieve the same behaviour by allocating an
> >> INT64 Arrow Array, increase the reference count of each object and then
> >> store the pointers of the object in this array. While this may work,
> please
> >> don't use this kind of hack.
> >>
> >> The main concept of Arrow is to define data structures that can be
> >> exchanged between applications that are implemented in different
> languages
> >> and ecosystems. Storing Python objects in them is a bit against its use
> >> case (we might support this one day for convenience in Python but it
> will
> >> be discouraged). In Arrow we have the concept of a UNION type, i.e. we
> can
> >> specify that a row can contain an object of a fixed set of types. This
> will
> >> bring you nearly the same abilities you have with the object type but
> with
> >> the improvement that you could also pass this data to another Arrow
> >> consumer of any language and it can cope with the data. But this also
> comes
> >> a bit at the cost of usability: You need to specify the types that
> occur in
> >> the array (this one is also an "at least for", we may write some
> >> auto-detection in the future but this a bit of work).
> >>
> >> > 2. Pandas and numpy can also handle dictionary elements using the
> >> > dtype=object while pyarrow arrays don't. I'd like to understand the
> >> > reasoning behind the choice here as well.
> >>
> >> This is again to due being more statically typed than just supporting
> >> pointers to generic objects. For this we actually have at the moment a
> >> STRUCT type in Arrow that supports in each row we have a set of named
> >> entries where each entry has a fixed type (but the types can be
> different
> >> between entries). Alternatively we also have a MAP<KEY, VALUE> type
> (that
> >> probably needs some more specification work). Here you store data as
> you do
> >> in a typical Python dictionary but KEY and VALUE are fixed types.
> Depending
> >> on your data either STRUCT or MAP might be the correct types to use.
> >>
> >> As we talk in general about columnar data in the Arrow context, we
> expect
> >> that the data in a column is of the same or a similar type in each row
> of a
> >> column.
> >>
> >> Uwe
> >>
>


Re: Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

2018-01-22 Thread simba nyatsanga
Hi Uwe,

Thank you very much for the detailed explanation. I have a much better
understanding now.

Cheers

On Mon, 22 Jan 2018 at 19:37 Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Simba,
>
> find the answers inline.
>
> On Mon, Jan 22, 2018, at 7:29 AM, simba nyatsanga wrote:
> > Hi Everyone,
> >
> > I've got two questions that I'd like help with:
> >
> > 1. Pandas and numpy arrays can handle multiple types in a sequence eg. a
> > float and a string by using the dtype=object. From what I gather, Arrow
> > arrays enforce a uniform type depending on the type of the first
> > encountered element in a sequence. This looks like a deliberate choice
> and
> > I'd like to get a better understanding of the reason for ensuring this
> > conformity. Does making the data structure's type deterministic allow for
> > efficient pointer arithmetic when reading contiguous blocks and thus
> making
> > reading performant?
>
> As NumPy arrays, Arrow arrays are statically typed. In the case of NumPy
> you simply have the limitation that the type system can only represent a
> small number of types. Especially all these types are primitive and allow
> no nesting (e.g. you cannot implement a NumPy array of NumPy arrays of
> varying lengths). In NumPy you have the way to work around this limitation
> by using the object type. This simply means you have any array of (64bit)
> pointers to Python objects of which NumPy does know nothing. In the most
> simplistic form, you could achieve the same behaviour by allocating an
> INT64 Arrow Array, increase the reference count of each object and then
> store the pointers of the object in this array. While this may work, please
> don't use this kind of hack.
>
> The main concept of Arrow is to define data structures that can be
> exchanged between applications that are implemented in different languages
> and ecosystems. Storing Python objects in them is a bit against its use
> case (we might support this one day for convenience in Python but it will
> be discouraged). In Arrow we have the concept of a UNION type, i.e. we can
> specify that a row can contain an object of a fixed set of types. This will
> bring you nearly the same abilities you have with the object type but with
> the improvement that you could also pass this data to another Arrow
> consumer of any language and it can cope with the data. But this also comes
> a bit at the cost of usability: You need to specify the types that occur in
> the array (this one is also an "at least for", we may write some
> auto-detection in the future but this a bit of work).
>
> > 2. Pandas and numpy can also handle dictionary elements using the
> > dtype=object while pyarrow arrays don't. I'd like to understand the
> > reasoning behind the choice here as well.
>
> This is again to due being more statically typed than just supporting
> pointers to generic objects. For this we actually have at the moment a
> STRUCT type in Arrow that supports in each row we have a set of named
> entries where each entry has a fixed type (but the types can be different
> between entries). Alternatively we also have a MAP<KEY, VALUE> type (that
> probably needs some more specification work). Here you store data as you do
> in a typical Python dictionary but KEY and VALUE are fixed types. Depending
> on your data either STRUCT or MAP might be the correct types to use.
>
> As we talk in general about columnar data in the Arrow context, we expect
> that the data in a column is of the same or a similar type in each row of a
> column.
>
> Uwe
>


Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

2018-01-21 Thread simba nyatsanga
Hi Everyone,

I've got two questions that I'd like help with:

1. Pandas and numpy arrays can handle multiple types in a sequence eg. a
float and a string by using the dtype=object. From what I gather, Arrow
arrays enforce a uniform type depending on the type of the first
encountered element in a sequence. This looks like a deliberate choice and
I'd like to get a better understanding of the reason for ensuring this
conformity. Does making the data structure's type deterministic allow for
efficient pointer arithmetic when reading contiguous blocks and thus making
reading performant?

2. Pandas and numpy can also handle dictionary elements using the
dtype=object while pyarrow arrays don't. I'd like to understand the
reasoning behind the choice here as well.

Thanks again for taking my questions.

Kind Regards
Simba


Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
Great, thank you for the explanation - it makes so much sense. I have a use
case where once I've converted an Arrow table back to pandas I then convert
it into a dictionary (with to_dict()). This dictionary then gets JSON
serialised and sent over the wire for display on the client side. I
encountered the behaviour when the JSON serialisation was failing for an
ndarray.

I think in addition to the performance/efficiency considerations you
mentioned, there isn't a strong need for the list option (atleast for me).
I will handle such data types at the application level.

Thanks.

On Thu, 18 Jan 2018 at 23:01 Wes McKinney <wesmck...@gmail.com> wrote:

> Upon converting to Arrow, the information about whether the original
> input was a list or ndarray was lost. So any kind of sequence ends up
> as an Arrow List type.
>
> When converting back to pandas, we could return either a list or an
> ndarray. Returning ndarray is faster and much more memory efficient;
> producing lists would require creating a lot of Python objects.
>
> Hypothetically, we could add an option to return lists instead of
> ndarrays if there were a strong enough need.
>
> - Wes
>
> On Thu, Jan 18, 2018 at 2:10 PM, simba nyatsanga <simnyatsa...@gmail.com>
> wrote:
> > Hi Wes,
> >
> > Great! Thanks for the pointer. From what I gather this is a fundamental
> and
> > deliberate design decision. Would I be correct in saying the memory
> > footprint and access speed of a NumPy array compared to that of a Python
> > list is the reason why the conversion is done?
> >
> > Kind Regards
> > Simba
> >
> > On Thu, 18 Jan 2018 at 20:35 Wes McKinney <wesmck...@gmail.com> wrote:
> >
> >> hi Simba,
> >>
> >> Yes -- Arrow list types are converted to NumPy arrays when converting
> >> back to pandas with to_pandas(...). This conversion happens in C++ code
> in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc#L541
> >>
> >> - Wes
> >>
> >> On Thu, Jan 18, 2018 at 1:26 PM, simba nyatsanga <
> simnyatsa...@gmail.com>
> >> wrote:
> >>
> >> > Good day everyone,
> >> >
> >> > I noticed what looks like type inference happening after persisting a
> >> > pandas DataFrame where one of the column values is a list. When I
> load up
> >> > the DataFrame again and do df.to_dict(), the value is no longer a list
> >> but
> >> > a numpy array. I dug through functions in the pandas_compat.py to try
> and
> >> > figure out at what point the dtype is being applied for that value.
> >> >
> >> > I'd like to verify if this is the intended behaviour.
> >> >
> >> > Here's an illustration of the behaviour:
> >> >
> >> > [image: Screen Shot 2018-01-18 at 15.54.59.png]
> >> >
> >> > Kind Regards
> >> > Simba
> >> >
> >>
>


Re: PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
Hi Wes,

Great! Thanks for the pointer. From what I gather this is a fundamental and
deliberate design decision. Would I be correct in saying the memory
footprint and access speed of a NumPy array compared to that of a Python
list is the reason why the conversion is done?

Kind Regards
Simba

On Thu, 18 Jan 2018 at 20:35 Wes McKinney <wesmck...@gmail.com> wrote:

> hi Simba,
>
> Yes -- Arrow list types are converted to NumPy arrays when converting
> back to pandas with to_pandas(...). This conversion happens in C++ code in
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc#L541
>
> - Wes
>
> On Thu, Jan 18, 2018 at 1:26 PM, simba nyatsanga <simnyatsa...@gmail.com>
> wrote:
>
> > Good day everyone,
> >
> > I noticed what looks like type inference happening after persisting a
> > pandas DataFrame where one of the column values is a list. When I load up
> > the DataFrame again and do df.to_dict(), the value is no longer a list
> but
> > a numpy array. I dug through functions in the pandas_compat.py to try and
> > figure out at what point the dtype is being applied for that value.
> >
> > I'd like to verify if this is the intended behaviour.
> >
> > Here's an illustration of the behaviour:
> >
> > [image: Screen Shot 2018-01-18 at 15.54.59.png]
> >
> > Kind Regards
> > Simba
> >
>


PyArrow python list to numpy nd.array inference in pd.read_table

2018-01-18 Thread simba nyatsanga
Good day everyone,

I noticed what looks like type inference happening after persisting a
pandas DataFrame where one of the column values is a list. When I load up
the DataFrame again and do df.to_dict(), the value is no longer a list but
a numpy array. I dug through functions in the pandas_compat.py to try and
figure out at what point the dtype is being applied for that value.

I'd like to verify if this is the intended behaviour.

Here's an illustration of the behaviour:

[image: Screen Shot 2018-01-18 at 15.54.59.png]

Kind Regards
Simba


Re: Trying to build to build pyarrow for python 2.7

2018-01-17 Thread simba nyatsanga
Hi Wes,

Great, thanks for the information.

On Tue, 16 Jan 2018 at 20:19 Wes McKinney <wesmck...@gmail.com> wrote:

> hi Simba -- the PyPI / pip wheels will only be updated when there is a
> new release. We'll either make a 0.8.1 release or 0.9.0 sometime in
> February depending on how development is progressing.
>
> - Wes
>
> On Sun, Jan 14, 2018 at 9:19 AM, simba nyatsanga <simnyatsa...@gmail.com>
> wrote:
> > Thanks a lot. I see that there's a PR  that's been opened to resolve the
> > encoding issue - https://github.com/apache/arrow/pull/1476
> >
> > Do you think this PR (if merged ) will also roll out as part of version
> > 0.9.0, or I'll be able to pip install with the merge commit as soon as
> it's
> > merged?
> >
> > Kind Regards
> >
> > On Sun, 14 Jan 2018 at 15:50 Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> >> Nice to hear that it worked.
> >>
> >> Updating the docs should not be necessary, we should rather see that we
> >> soon get a 0.9.0 release out (but that will also take some more weeks)
> >>
> >> Uwe
> >>
> >> On Sun, Jan 14, 2018, at 2:42 PM, simba nyatsanga wrote:
> >> > Amazing, thanks Uwe!
> >> >
> >> > I was able to build pyarrow successfully for python 2.7 using your
> >> > workaround. I appreciate that you've got a possible solution for the
> too.
> >> >
> >> > Besides the PR getting reviewed by more experienced maintainers, I'm
> >> > thinking to pull your branch and try the building process from
> scratch.
> >> > Otherwise I was wondering if it's valuable, in the meantime, to update
> >> the
> >> > docs with your work around?
> >> >
> >> > Kind Regards
> >> > Simba
> >> >
> >> > On Sun, 14 Jan 2018 at 15:17 Uwe L. Korn <uw...@xhochy.com> wrote:
> >> >
> >> > > Hello Simba,
> >> > >
> >> > > it looks like you are running to
> >> > > https://issues.apache.org/jira/browse/ARROW-1856.
> >> > >
> >> > > To work around this issue, please "unset PARQUET_HOME" before you
> call
> >> the
> >> > > setup.py. Also set PKG_CONFIG_PATH, in your case this should be
> "export
> >> > >
> PKG_CONFIG_PATH=/Users/simba/anaconda/envs/pyarrow-dev/lib/pkgconfig".
> >> By
> >> > > doing this, you do the package discovery using pkg-config instead of
> >> the
> >> > > *_HOME variables. Currently this is the only path on which we can
> >> > > auto-detect the extension of the parquet shared library.
> >> > >
> >> > > Nevertheless, I will take a shot at fixing the issues as it seems
> that
> >> > > multiple users run into it.
> >> > >
> >> > > Uwe
> >> > >
> >> > > On Thu, Jan 11, 2018, at 11:42 PM, simba nyatsanga wrote:
> >> > > > Hi Wes,
> >> > > >
> >> > > > Apologies for the ambiguity there. To clarify, I used the conda
> >> > > > instructions only to create a conda environment. So I did this
> >> > > >
> >> > > > conda create -y -q -n pyarrow-dev \
> >> > > >   python=2.7 numpy six setuptools cython pandas pytest \
> >> > > >   cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy
> zlib \
> >> > > >   gflags brotli jemalloc lz4-c zstd -c conda-forge
> >> > > >
> >> > > >
> >> > > > I followed the instructions closely and I've stumbled upon a
> >> different
> >> > > > error from the one I initially had encountered. Now the issue
> seems
> >> to be
> >> > > > that when I'm building the Arrow C++ i.e running the following
> steps:
> >> > > >
> >> > > > mkdir parquet-cpp/build
> >> > > > pushd parquet-cpp/build
> >> > > >
> >> > > > cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
> >> > > >   -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \
> >> > > >   -DPARQUET_BUILD_BENCHMARKS=off \
> >> > > >   -DPARQUET_BUILD_EXECUTABLES=off \
> >> > > >   -DPARQUET_BUILD_TESTS=off \
> >> > > >   ..
> >> > > >
> >> > > > make -j4
> >> > > > make install
> >> > > > popd
> 

Re: Trying to build to build pyarrow for python 2.7

2018-01-14 Thread simba nyatsanga
Thanks a lot. I see that there's a PR  that's been opened to resolve the
encoding issue - https://github.com/apache/arrow/pull/1476

Do you think this PR (if merged ) will also roll out as part of version
0.9.0, or I'll be able to pip install with the merge commit as soon as it's
merged?

Kind Regards

On Sun, 14 Jan 2018 at 15:50 Uwe L. Korn <uw...@xhochy.com> wrote:

> Nice to hear that it worked.
>
> Updating the docs should not be necessary, we should rather see that we
> soon get a 0.9.0 release out (but that will also take some more weeks)
>
> Uwe
>
> On Sun, Jan 14, 2018, at 2:42 PM, simba nyatsanga wrote:
> > Amazing, thanks Uwe!
> >
> > I was able to build pyarrow successfully for python 2.7 using your
> > workaround. I appreciate that you've got a possible solution for the too.
> >
> > Besides the PR getting reviewed by more experienced maintainers, I'm
> > thinking to pull your branch and try the building process from scratch.
> > Otherwise I was wondering if it's valuable, in the meantime, to update
> the
> > docs with your work around?
> >
> > Kind Regards
> > Simba
> >
> > On Sun, 14 Jan 2018 at 15:17 Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello Simba,
> > >
> > > it looks like you are running to
> > > https://issues.apache.org/jira/browse/ARROW-1856.
> > >
> > > To work around this issue, please "unset PARQUET_HOME" before you call
> the
> > > setup.py. Also set PKG_CONFIG_PATH, in your case this should be "export
> > > PKG_CONFIG_PATH=/Users/simba/anaconda/envs/pyarrow-dev/lib/pkgconfig".
> By
> > > doing this, you do the package discovery using pkg-config instead of
> the
> > > *_HOME variables. Currently this is the only path on which we can
> > > auto-detect the extension of the parquet shared library.
> > >
> > > Nevertheless, I will take a shot at fixing the issues as it seems that
> > > multiple users run into it.
> > >
> > > Uwe
> > >
> > > On Thu, Jan 11, 2018, at 11:42 PM, simba nyatsanga wrote:
> > > > Hi Wes,
> > > >
> > > > Apologies for the ambiguity there. To clarify, I used the conda
> > > > instructions only to create a conda environment. So I did this
> > > >
> > > > conda create -y -q -n pyarrow-dev \
> > > >   python=2.7 numpy six setuptools cython pandas pytest \
> > > >   cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \
> > > >   gflags brotli jemalloc lz4-c zstd -c conda-forge
> > > >
> > > >
> > > > I followed the instructions closely and I've stumbled upon a
> different
> > > > error from the one I initially had encountered. Now the issue seems
> to be
> > > > that when I'm building the Arrow C++ i.e running the following steps:
> > > >
> > > > mkdir parquet-cpp/build
> > > > pushd parquet-cpp/build
> > > >
> > > > cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
> > > >   -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \
> > > >   -DPARQUET_BUILD_BENCHMARKS=off \
> > > >   -DPARQUET_BUILD_EXECUTABLES=off \
> > > >   -DPARQUET_BUILD_TESTS=off \
> > > >   ..
> > > >
> > > > make -j4
> > > > make install
> > > > popd
> > > >
> > > >
> > > > The make install step generates *libparquet.1.3.2.dylib* as one of
> the
> > > > artefacts, as illustrated below:
> > > >
> > > > -- Install configuration: "RELEASE"-- Installing:
> > > >
> /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/cmake/parquet-
> > > > cppConfig.cmake--
> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/
> > > > cmake/parquet-cppConfigVersion.cmake--
> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.
> > > > 1.3.2.dylib--
> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.
> > > > 1.dylib--
> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/
> > > > libparquet.dylib--
> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.a--
> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > > > column_reader.h--
> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > > > column_page.h--
> > > > 

Re: Trying to build to build pyarrow for python 2.7

2018-01-14 Thread simba nyatsanga
Amazing, thanks Uwe!

I was able to build pyarrow successfully for python 2.7 using your
workaround. I appreciate that you've got a possible solution for the too.

Besides the PR getting reviewed by more experienced maintainers, I'm
thinking to pull your branch and try the building process from scratch.
Otherwise I was wondering if it's valuable, in the meantime, to update the
docs with your work around?

Kind Regards
Simba

On Sun, 14 Jan 2018 at 15:17 Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Simba,
>
> it looks like you are running to
> https://issues.apache.org/jira/browse/ARROW-1856.
>
> To work around this issue, please "unset PARQUET_HOME" before you call the
> setup.py. Also set PKG_CONFIG_PATH, in your case this should be "export
> PKG_CONFIG_PATH=/Users/simba/anaconda/envs/pyarrow-dev/lib/pkgconfig". By
> doing this, you do the package discovery using pkg-config instead of the
> *_HOME variables. Currently this is the only path on which we can
> auto-detect the extension of the parquet shared library.
>
> Nevertheless, I will take a shot at fixing the issues as it seems that
> multiple users run into it.
>
> Uwe
>
> On Thu, Jan 11, 2018, at 11:42 PM, simba nyatsanga wrote:
> > Hi Wes,
> >
> > Apologies for the ambiguity there. To clarify, I used the conda
> > instructions only to create a conda environment. So I did this
> >
> > conda create -y -q -n pyarrow-dev \
> >   python=2.7 numpy six setuptools cython pandas pytest \
> >   cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \
> >   gflags brotli jemalloc lz4-c zstd -c conda-forge
> >
> >
> > I followed the instructions closely and I've stumbled upon a different
> > error from the one I initially had encountered. Now the issue seems to be
> > that when I'm building the Arrow C++ i.e running the following steps:
> >
> > mkdir parquet-cpp/build
> > pushd parquet-cpp/build
> >
> > cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
> >   -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \
> >   -DPARQUET_BUILD_BENCHMARKS=off \
> >   -DPARQUET_BUILD_EXECUTABLES=off \
> >   -DPARQUET_BUILD_TESTS=off \
> >   ..
> >
> > make -j4
> > make install
> > popd
> >
> >
> > The make install step generates *libparquet.1.3.2.dylib* as one of the
> > artefacts, as illustrated below:
> >
> > -- Install configuration: "RELEASE"-- Installing:
> > /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/cmake/parquet-
> > cppConfig.cmake--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/
> > cmake/parquet-cppConfigVersion.cmake--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.
> > 1.3.2.dylib--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.
> > 1.dylib--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/
> > libparquet.dylib--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.a--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > column_reader.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > column_page.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > column_scanner.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > column_writer.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > encoding.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > exception.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > file_reader.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > file_writer.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > metadata.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > printer.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > properties.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > schema.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > statistics.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > types.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> > parquet_version.h--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/pkgconfig/
> > parquet.pc--
> > Installing: /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/api/
&g

Re: Trying to build to build pyarrow for python 2.7

2018-01-11 Thread simba nyatsanga
g 3.8.0svn
Configured for RELEASE build (set with cmake
-DCMAKE_BUILD_TYPE={release,debug,...})-- Build Type: RELEASE-- Build
output directory:
/Users/simba/Projects/personal/oss/arrow/python/build/temp.macosx-10.9-x86_64-2.7/release/--
Checking for module 'arrow'--   Found arrow, version 0.9.0-SNAPSHOT--
Arrow ABI version: 0.0.0-- Arrow SO version: 0-- Found the Arrow core
library: /Users/simba/anaconda/envs/pyarrow-dev/lib/libarrow.dylib--
Found the Arrow Python library:
/Users/simba/anaconda/envs/pyarrow-dev/lib/libarrow_python.dylib
Added shared library dependency arrow:
/Users/simba/anaconda/envs/pyarrow-dev/lib/libarrow.dylib
Added shared library dependency arrow_python:
/Users/simba/anaconda/envs/pyarrow-dev/lib/libarrow_python.dylib--
Found the Parquet library:
/Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.dylib
CMake Error: File
/Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.1.0.0.dylib does
not exist.
CMake Error at CMakeLists.txt:213 (configure_file):
  configure_file Problem configuring file
Call Stack (most recent call first):
  CMakeLists.txt:296 (bundle_arrow_lib)


Added shared library dependency parquet:
/Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.dylib-- Checking
for module 'plasma'--   Found plasma, version-- Plasma ABI version:
0.0.0-- Plasma SO version: 0-- Found the Plasma core library:
/Users/simba/anaconda/envs/pyarrow-dev/lib/libplasma.dylib-- Found
Plasma executable:
/Users/simba/anaconda/envs/pyarrow-dev/bin/plasma_store
Added shared library dependency libplasma:
/Users/simba/anaconda/envs/pyarrow-dev/lib/libplasma.dylib--
Configuring incomplete, errors occurred!
See also 
"/Users/simba/Projects/personal/oss/arrow/python/build/temp.macosx-10.9-x86_64-2.7/CMakeFiles/CMakeOutput.log".
See also 
"/Users/simba/Projects/personal/oss/arrow/python/build/temp.macosx-10.9-x86_64-2.7/CMakeFiles/CMakeError.log".error:
command 'cmake' failed with exit status 1


Also (might be) worth noting from above is that I'm picking up *arrow
0.9.0-SNAPSHOT.*

>From what I can see in the */Users/simba/anaconda/envs/pyarrow-dev/lib*
folder the sym link is infact pointing to *libparquet.1.3.2.dylib *instead
of the expected *libparquet.1.0.0.dylib*:

> pwd/Users/simba/anaconda/envs/pyarrow-dev/lib> ll | grep 
> "libparquet"-rwxr-xr-x1 simba  staff   1.6M Jan 11 18:45 
> libparquet.1.3.2.dylib
lrwxr-xr-x1 simba  staff22B Jan 11 18:45 libparquet.1.dylib ->
libparquet.1.3.2.dylib-rw-r--r--1 simba  staff   3.0M Jan 11 18:45
libparquet.a
lrwxr-xr-x1 simba  staff18B Jan 11 18:45 libparquet.dylib ->
libparquet.1.dylib



Just to clarify also, I'm attempting to build the wheel from within
*arrow/python* folder where the *setup.py* file is.

Thanks again for the help.

Simba



On Thu, 11 Jan 2018 at 09:09 simba nyatsanga <simnyatsa...@gmail.com> wrote:

> Hi Wes,
>
> Thanks for the response. I was following the development instructions on
> Github here:
> https://github.com/apache/arrow/blob/master/python/doc/source/development.rst
>
> I took MacOS option and installed my virtual env via conda. I must've
> missed an instruction when trying the 2.7 install, because I was able to
> successfully install for 3.6.
>
> Although it looks like the instructions on Github are similar to the ones
> you linked, I will give it another go with the later.
>
> Kind Regards
> Simba
>
> On Thu, 11 Jan 2018 at 00:51 Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Simba,
>>
>> Are you following development instructions in
>>
>> http://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos
>> or something else?
>>
>> - Wes
>>
>> On Wed, Jan 10, 2018 at 11:20 AM, simba nyatsanga
>> <simnyatsa...@gmail.com> wrote:
>> > Hi,
>> >
>> > I've created a python 2.7 virtualenv in my attempt to build the pyarrow
>> > project. But I'm having trouble running one of commands as specified in
>> the
>> > development docs on Github, specifically this command:
>> >
>> > cd arrow/python
>> > python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \
>> >--with-parquet --with-plasma --inplace
>> >
>> > The error output looks like this:
>> >
>> > running build_ext-- Runnning cmake for pyarrow
>> > cmake
>> -DPYTHON_EXECUTABLE=/Users/simba/anaconda/envs/pyarrow-dev-py2.7/bin/python
>> >  -DPYARROW_BUILD_PARQUET=on -DPYARROW_BUILD_PLASMA=on
>> > -DCMAKE_BUILD_TYPE= /Users/simba/Projects/personal/oss/arrow/python
>> > INFOCompiler command: /Library/Developer/CommandLineTools/usr/bin/c++
>> > INFOCompiler version: Apple LLVM version 8.0.0
>> > (clang-800.0.42.1)Target: x86_64-apple-darwin15.6.0
>> > Thr

Re: Trying to build to build pyarrow for python 2.7

2018-01-10 Thread simba nyatsanga
Hi Wes,

Thanks for the response. I was following the development instructions on
Github here:
https://github.com/apache/arrow/blob/master/python/doc/source/development.rst

I took MacOS option and installed my virtual env via conda. I must've
missed an instruction when trying the 2.7 install, because I was able to
successfully install for 3.6.

Although it looks like the instructions on Github are similar to the ones
you linked, I will give it another go with the later.

Kind Regards
Simba

On Thu, 11 Jan 2018 at 00:51 Wes McKinney <wesmck...@gmail.com> wrote:

> hi Simba,
>
> Are you following development instructions in
>
> http://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos
> or something else?
>
> - Wes
>
> On Wed, Jan 10, 2018 at 11:20 AM, simba nyatsanga
> <simnyatsa...@gmail.com> wrote:
> > Hi,
> >
> > I've created a python 2.7 virtualenv in my attempt to build the pyarrow
> > project. But I'm having trouble running one of commands as specified in
> the
> > development docs on Github, specifically this command:
> >
> > cd arrow/python
> > python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \
> >--with-parquet --with-plasma --inplace
> >
> > The error output looks like this:
> >
> > running build_ext-- Runnning cmake for pyarrow
> > cmake
> -DPYTHON_EXECUTABLE=/Users/simba/anaconda/envs/pyarrow-dev-py2.7/bin/python
> >  -DPYARROW_BUILD_PARQUET=on -DPYARROW_BUILD_PLASMA=on
> > -DCMAKE_BUILD_TYPE= /Users/simba/Projects/personal/oss/arrow/python
> > INFOCompiler command: /Library/Developer/CommandLineTools/usr/bin/c++
> > INFOCompiler version: Apple LLVM version 8.0.0
> > (clang-800.0.42.1)Target: x86_64-apple-darwin15.6.0
> > Thread model: posixInstalledDir:
> /Library/Developer/CommandLineTools/usr/bin
> >
> > INFOCompiler id: Clang
> > Selected compiler clang 3.8.0svn
> > Configured for DEBUG build (set with cmake
> > -DCMAKE_BUILD_TYPE={release,debug,...})-- Build Type: DEBUG-- Build
> > output directory:
> > /Users/simba/Projects/personal/oss/arrow/python/build/debug/--
> > Checking for module 'arrow'--   No package 'arrow' found-- Found the
> > Arrow core library:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow.dylib--
> > Found the Arrow Python library:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow_python.dylib
> > Added shared library dependency arrow:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow.dylib
> > Added shared library dependency arrow_python:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow_python.dylib--
> > Checking for module 'parquet'--   No package 'parquet' found-- Found
> > the Parquet library:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libparquet.dylib
> > Added shared library dependency parquet:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libparquet.dylib--
> > Checking for module 'plasma'--   No package 'plasma' found-- Found the
> > Plasma core library:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libplasma.dylib--
> > Found Plasma executable:
> > Added shared library dependency libplasma:
> > /Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libplasma.dylib--
> > Configuring done-- Generating done-- Build files have been written to:
> > /Users/simba/Projects/personal/oss/arrow/python-- Finished cmake for
> > pyarrow-- Running cmake --build for pyarrow
> > makemake: *** No targets specified and no makefile found.  Stop.error:
> > command 'make' failed with exit status 2
> >
> >
> > It looks like there's a change dir happening at this line in the
> setup.py:
> > https://github.com/apache/arrow/blob/master/python/setup.py#L136
> > Which, in my case, is switching to the temp build which doesn't have the
> > required Makefile to run the make command.
> >
> > I could be missing something because I was able to build the project
> > successfully for python3. But I'd like to build it in python2.7 to
> attempt
> > a bug fix for this issue:
> https://issues.apache.org/jira/browse/ARROW-1976
> >
> > Thanks for help.
> >
> > Kind Regards
> > Simba
>


Trying to build to build pyarrow for python 2.7

2018-01-10 Thread simba nyatsanga
Hi,

I've created a python 2.7 virtualenv in my attempt to build the pyarrow
project. But I'm having trouble running one of commands as specified in the
development docs on Github, specifically this command:

cd arrow/python
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \
   --with-parquet --with-plasma --inplace

The error output looks like this:

running build_ext-- Runnning cmake for pyarrow
cmake  
-DPYTHON_EXECUTABLE=/Users/simba/anaconda/envs/pyarrow-dev-py2.7/bin/python
 -DPYARROW_BUILD_PARQUET=on -DPYARROW_BUILD_PLASMA=on
-DCMAKE_BUILD_TYPE= /Users/simba/Projects/personal/oss/arrow/python
INFOCompiler command: /Library/Developer/CommandLineTools/usr/bin/c++
INFOCompiler version: Apple LLVM version 8.0.0
(clang-800.0.42.1)Target: x86_64-apple-darwin15.6.0
Thread model: posixInstalledDir: /Library/Developer/CommandLineTools/usr/bin

INFOCompiler id: Clang
Selected compiler clang 3.8.0svn
Configured for DEBUG build (set with cmake
-DCMAKE_BUILD_TYPE={release,debug,...})-- Build Type: DEBUG-- Build
output directory:
/Users/simba/Projects/personal/oss/arrow/python/build/debug/--
Checking for module 'arrow'--   No package 'arrow' found-- Found the
Arrow core library:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow.dylib--
Found the Arrow Python library:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow_python.dylib
Added shared library dependency arrow:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow.dylib
Added shared library dependency arrow_python:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libarrow_python.dylib--
Checking for module 'parquet'--   No package 'parquet' found-- Found
the Parquet library:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libparquet.dylib
Added shared library dependency parquet:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libparquet.dylib--
Checking for module 'plasma'--   No package 'plasma' found-- Found the
Plasma core library:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libplasma.dylib--
Found Plasma executable:
Added shared library dependency libplasma:
/Users/simba/anaconda/envs/pyarrow-dev-py2.7/lib/libplasma.dylib--
Configuring done-- Generating done-- Build files have been written to:
/Users/simba/Projects/personal/oss/arrow/python-- Finished cmake for
pyarrow-- Running cmake --build for pyarrow
makemake: *** No targets specified and no makefile found.  Stop.error:
command 'make' failed with exit status 2


It looks like there's a change dir happening at this line in the setup.py:
https://github.com/apache/arrow/blob/master/python/setup.py#L136
Which, in my case, is switching to the temp build which doesn't have the
required Makefile to run the make command.

I could be missing something because I was able to build the project
successfully for python3. But I'd like to build it in python2.7 to attempt
a bug fix for this issue: https://issues.apache.org/jira/browse/ARROW-1976

Thanks for help.

Kind Regards
Simba