[jira] [Commented] (ARROW-780) PYTHON_EXECUTABLE Required to be set during build

2017-04-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968344#comment-15968344
 ] 

Phillip Cloud commented on ARROW-780:
-

Yes. [~wesmckinn] Can you show the output of your cmake when building 
arrow-cpp? I'd like to see what it should be doing for reference.

> PYTHON_EXECUTABLE Required to be set during build
> -
>
> Key: ARROW-780
> URL: https://issues.apache.org/jira/browse/ARROW-780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: build
>
> I had to set PYTHON_EXECUTABLE to my conda environment's Python interpreter. 
> [~wesm_impala_7e40] says he doesn't have to. We should clarify whether this 
> is necessary, if it in fact is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-780) PYTHON_EXECUTABLE Required to be set during build

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968330#comment-15968330
 ] 

Wes McKinney commented on ARROW-780:


Is this still an issue?

> PYTHON_EXECUTABLE Required to be set during build
> -
>
> Key: ARROW-780
> URL: https://issues.apache.org/jira/browse/ARROW-780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: build
>
> I had to set PYTHON_EXECUTABLE to my conda environment's Python interpreter. 
> [~wesm_impala_7e40] says he doesn't have to. We should clarify whether this 
> is necessary, if it in fact is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array

2017-04-13 Thread Itai Incze (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968317#comment-15968317
 ] 

Itai Incze commented on ARROW-809:
--

I wrote the comment before seeing yours latest one, so not in order to doubt 
the solution. 
I've seen that code...  though I'm certain you're much better acquainted with 
it than I am :)


> C++: Writing sliced record batch to IPC writes the entire array
> ---
>
> Key: ARROW-809
> URL: https://issues.apache.org/jira/browse/ARROW-809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Itai Incze
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.3.0
>
>
> The bug can be triggered through python:
> {code}
> import pyarrow.parquet
> array = pyarrow.array.from_pylist([1] * 100)
> rb = pyarrow.RecordBatch.from_arrays([array], ['a'])
> rb2 = rb.slice(0,2)
> with open('/tmp/t.arrow', 'wb') as f:
>   w = pyarrow.ipc.FileWriter(f, rb.schema)
>   w.write_batch(rb2)
>   w.close()
> {code}
> which will result in a big file:
> {code}
> $ ll /tmp/t.arrow 
> -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-820) [C++] Build dependencies for Parquet library without arrow support

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968313#comment-15968313
 ] 

Wes McKinney commented on ARROW-820:


Is leaving the {{arrow/ipc}} directory off and its thirdparty dependencies 
(flatbuffers, rapidjson) sufficient?

> [C++] Build dependencies for Parquet library without arrow support
> --
>
> Key: ARROW-820
> URL: https://issues.apache.org/jira/browse/ARROW-820
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>
> Parquet C++ library without Arrow depends only on a subset of Arrow 
> components(buffers, io). The scope of this JIRA is to build libarrow with 
> minimal dependencies for users of Parquet C++ library without Arrow support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968310#comment-15968310
 ] 

Wes McKinney commented on ARROW-809:


There is some buffer slicing happening on the IPC write path already: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L207. 
It needs to be made consistent (+ well tested), though

> C++: Writing sliced record batch to IPC writes the entire array
> ---
>
> Key: ARROW-809
> URL: https://issues.apache.org/jira/browse/ARROW-809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Itai Incze
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.3.0
>
>
> The bug can be triggered through python:
> {code}
> import pyarrow.parquet
> array = pyarrow.array.from_pylist([1] * 100)
> rb = pyarrow.RecordBatch.from_arrays([array], ['a'])
> rb2 = rb.slice(0,2)
> with open('/tmp/t.arrow', 'wb') as f:
>   w = pyarrow.ipc.FileWriter(f, rb.schema)
>   w.write_batch(rb2)
>   w.close()
> {code}
> which will result in a big file:
> {code}
> $ ll /tmp/t.arrow 
> -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-820) [C++] Build dependencies for Parquet library without arrow support

2017-04-13 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968304#comment-15968304
 ] 

Deepak Majeti commented on ARROW-820:
-

Will post a PR shortly.

> [C++] Build dependencies for Parquet library without arrow support
> --
>
> Key: ARROW-820
> URL: https://issues.apache.org/jira/browse/ARROW-820
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>
> Parquet C++ library without Arrow depends only on a subset of Arrow 
> components(buffers, io). The scope of this JIRA is to build libarrow with 
> minimal dependencies for users of Parquet C++ library without Arrow support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-820) [C++] Build dependencies for Parquet library without arrow support

2017-04-13 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-820:
---

 Summary: [C++] Build dependencies for Parquet library without 
arrow support
 Key: ARROW-820
 URL: https://issues.apache.org/jira/browse/ARROW-820
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Deepak Majeti


Parquet C++ library without Arrow depends only on a subset of Arrow 
components(buffers, io). The scope of this JIRA is to build libarrow with 
minimal dependencies for users of Parquet C++ library without Arrow support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array

2017-04-13 Thread Itai Incze (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968295#comment-15968295
 ] 

Itai Incze commented on ARROW-809:
--

I've fiddled with it a bit - without altering the array class, I found there's 
a problem finding the exact number of items with a boolean array - where it 
doesnt matter, and in union array. There may be other instances as well that 
i'm not aware of.

Seems to me that adding a private boolean {{IsSliced}} to the array is the 
cleanest way.


> C++: Writing sliced record batch to IPC writes the entire array
> ---
>
> Key: ARROW-809
> URL: https://issues.apache.org/jira/browse/ARROW-809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Itai Incze
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.3.0
>
>
> The bug can be triggered through python:
> {code}
> import pyarrow.parquet
> array = pyarrow.array.from_pylist([1] * 100)
> rb = pyarrow.RecordBatch.from_arrays([array], ['a'])
> rb2 = rb.slice(0,2)
> with open('/tmp/t.arrow', 'wb') as f:
>   w = pyarrow.ipc.FileWriter(f, rb.schema)
>   w.write_batch(rb2)
>   w.close()
> {code}
> which will result in a big file:
> {code}
> $ ll /tmp/t.arrow 
> -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Issue Comment Deleted] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array

2017-04-13 Thread Itai Incze (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Itai Incze updated ARROW-809:
-
Comment: was deleted

(was: Agreed - its a small and easy bug. All is needed is to agree on the 
approach. 

I've fiddled with it a bit - without altering the array class, I found there's 
a problem finding the exact number of items with a boolean array - where it 
doesnt matter, and in union array. There may be other instances as well that 
i'm not aware of.

Seems to me that adding a private boolean {{IsSliced}} to the array is the 
cleanest way. 



)

> C++: Writing sliced record batch to IPC writes the entire array
> ---
>
> Key: ARROW-809
> URL: https://issues.apache.org/jira/browse/ARROW-809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Itai Incze
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.3.0
>
>
> The bug can be triggered through python:
> {code}
> import pyarrow.parquet
> array = pyarrow.array.from_pylist([1] * 100)
> rb = pyarrow.RecordBatch.from_arrays([array], ['a'])
> rb2 = rb.slice(0,2)
> with open('/tmp/t.arrow', 'wb') as f:
>   w = pyarrow.ipc.FileWriter(f, rb.schema)
>   w.write_batch(rb2)
>   w.close()
> {code}
> which will result in a big file:
> {code}
> $ ll /tmp/t.arrow 
> -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-809) C++: Writing sliced record batch to IPC writes the entire array

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968285#comment-15968285
 ] 

Wes McKinney commented on ARROW-809:


I'm going to truncate the data buffers to a 64-byte padding offset, patch 
coming tomorrow probably

> C++: Writing sliced record batch to IPC writes the entire array
> ---
>
> Key: ARROW-809
> URL: https://issues.apache.org/jira/browse/ARROW-809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Itai Incze
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.3.0
>
>
> The bug can be triggered through python:
> {code}
> import pyarrow.parquet
> array = pyarrow.array.from_pylist([1] * 100)
> rb = pyarrow.RecordBatch.from_arrays([array], ['a'])
> rb2 = rb.slice(0,2)
> with open('/tmp/t.arrow', 'wb') as f:
>   w = pyarrow.ipc.FileWriter(f, rb.schema)
>   w.write_batch(rb2)
>   w.close()
> {code}
> which will result in a big file:
> {code}
> $ ll /tmp/t.arrow 
> -rw-rw-r-- 1 itai itai 800618 Apr 12 13:22 /tmp/t.arrow
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-819) [Python] Define and document public Cython API

2017-04-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-819:
--

 Summary: [Python] Define and document public Cython API
 Key: ARROW-819
 URL: https://issues.apache.org/jira/browse/ARROW-819
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


We have a handful of {{cdef api}} declarations, but it might be useful to have 
a proper {{pyarrow/api.pxd}} file and a prescribed implementation pattern for 
other Cython users to link to and use the Arrow types in other Python extensions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-818) [Python] Review public pyarrow.* API completeness and update docs

2017-04-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-818:
--

 Summary: [Python] Review public pyarrow.* API completeness and 
update docs
 Key: ARROW-818
 URL: https://issues.apache.org/jira/browse/ARROW-818
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.3.0


There are still many names missing from pyarrow.* and ARROW-797. We should do a 
final review and update before 0.3



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-816) [C++] Use conda packages for RapidJSON, Flatbuffers to speed up builds

2017-04-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-816.

Resolution: Fixed

Issue resolved by pull request 537
[https://github.com/apache/arrow/pull/537]

> [C++] Use conda packages for RapidJSON, Flatbuffers to speed up builds
> --
>
> Key: ARROW-816
> URL: https://issues.apache.org/jira/browse/ARROW-816
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> These libraries are being downloaded and built (in the case of Flatbuffers) 
> multiple times in the course of a normal build. It would be better to build 
> them ones and set the *_HOME environment variables so that they can be used 
> throughout the CI run.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-817) [C++] Fix incorrect code comment from ARROW-722

2017-04-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-817.

Resolution: Fixed

Issue resolved by pull request 536
[https://github.com/apache/arrow/pull/536]

> [C++] Fix incorrect code comment from ARROW-722
> ---
>
> Key: ARROW-817
> URL: https://issues.apache.org/jira/browse/ARROW-817
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-816) [C++] Use conda packages for RapidJSON, Flatbuffers to speed up builds

2017-04-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-816:
--

Assignee: Wes McKinney

> [C++] Use conda packages for RapidJSON, Flatbuffers to speed up builds
> --
>
> Key: ARROW-816
> URL: https://issues.apache.org/jira/browse/ARROW-816
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> These libraries are being downloaded and built (in the case of Flatbuffers) 
> multiple times in the course of a normal build. It would be better to build 
> them ones and set the *_HOME environment variables so that they can be used 
> throughout the CI run.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967700#comment-15967700
 ] 

Phillip Cloud commented on ARROW-785:
-

I'm not sure if there's still a possible issue here. When using drill, if I 
cast the {{WORD}} column to {{varchar}} then the data look fine. When left as 
{{binary}} the values are unintelligible:

{code}
0: jdbc:drill:zk=local> select `YEAR`, cast(`WORD` as varchar) as `WORD` from 
dfs.`/home/phillip/code/cpp/arrow/python/arrow_parquet.parquet`;
+---+-+
| YEAR  |  WORD   |
+---+-+
| 2017  | Word 1  |
| 2018  | Word 2  |
+---+-+
{code}

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-817) [C++] Fix incorrect code comment from ARROW-722

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967684#comment-15967684
 ] 

Wes McKinney commented on ARROW-817:


PR: https://github.com/apache/arrow/pull/536

> [C++] Fix incorrect code comment from ARROW-722
> ---
>
> Key: ARROW-817
> URL: https://issues.apache.org/jira/browse/ARROW-817
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-817) [C++] Fix incorrect code comment from ARROW-722

2017-04-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-817:
--

 Summary: [C++] Fix incorrect code comment from ARROW-722
 Key: ARROW-817
 URL: https://issues.apache.org/jira/browse/ARROW-817
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.3.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

2017-04-13 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967675#comment-15967675
 ] 

Phillip Cloud commented on ARROW-785:
-

I'm able to run this using {{beeline}} declaring the {{word}} column as either 
{{binary}} or {{string}} type in Hive:

{code}
ubuntu@impala:~$ beeline --silent=true --showHeader=false -u 
jdbc:hive2://localhost:1/default -n ubuntu   
0: jdbc:hive2://localhost:1/default> create external table t (year bigint, 
word string) stored as parquet location '/user/hive/warehouse/arrow';
0: jdbc:hive2://localhost:1/default> select * from t;
+-+-+--+
| 2017| Word 1  |
| 2018| Word 2  |
+-+-+--+
0: jdbc:hive2://localhost:1/default> create external table t2 (year bigint, 
word binary) stored as parquet location '/user/hive/warehouse/arrow';
0: jdbc:hive2://localhost:1/default> select * from t2;
+--+--+--+
| 2017 | Word 1   |
| 2018 | Word 2   |
+--+--+--+
{code}

> possible issue on writing parquet via pyarrow, subsequently read in Hive
> 
>
> Key: ARROW-785
> URL: https://issues.apache.org/jira/browse/ARROW-785
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jeff Reback
>Priority: Minor
> Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-816) [C++] Use conda packages for RapidJSON, Flatbuffers to speed up builds

2017-04-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-816:
--

 Summary: [C++] Use conda packages for RapidJSON, Flatbuffers to 
speed up builds
 Key: ARROW-816
 URL: https://issues.apache.org/jira/browse/ARROW-816
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.3.0


These libraries are being downloaded and built (in the case of Flatbuffers) 
multiple times in the course of a normal build. It would be better to build 
them ones and set the *_HOME environment variables so that they can be used 
throughout the CI run.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967616#comment-15967616
 ] 

Wes McKinney commented on ARROW-300:


[~kiszk] I agree that having in-memory compression schemes like in Spark is a 
good idea, in addition to simpler snappy/lz4/zlib buffer compression. Would you 
like to make a proposal for improvements to the Arrow metadata to support these 
compression schemes? We should indicate that Arrow implementations are not 
required to implement these in general, so for now they can be marked as 
experimental and optional for implementations (e.g. we wouldn't necessarily 
integration test them). For scan-based in-memory columnar workloads, these 
encodings can yield better scan throughput because of better cache efficiency, 
and many column-oriented databases rely on this to be able to achieve high 
performance, so having it natively in the Arrow libraries seems useful. 

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-798) [Docs] Publish Format Markdown documents somehow on arrow.apache.org

2017-04-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967604#comment-15967604
 ] 

Wes McKinney commented on ARROW-798:


It seems like we shouldn't need something more complicated than Pelican (since 
a lot of us are Python users) or Jekyll (analogous to Pelican, but Ruby-based) 
for the main site. I figure we can have a central documentation landing point, 
with links into the language-specific documentation: Doxygen for C++, Sphinx 
for Python, etc.

> [Docs] Publish Format Markdown documents somehow on arrow.apache.org
> 
>
> Key: ARROW-798
> URL: https://issues.apache.org/jira/browse/ARROW-798
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-798) [Docs] Publish Format Markdown documents somehow on arrow.apache.org

2017-04-13 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967576#comment-15967576
 ] 

Uwe L. Korn commented on ARROW-798:
---

Currently we have a single static HTML page as the website for arrow. I would 
like to move it to the same infrastructure as other projects like Calcite use 
(they have the website in the main source and only push the rendered version to 
the Website GIT/SVN). With that we should be able to easily get Markdown 
documents and API docs rendered and uploaded to arrow.apache.org.

I already had a go to move to https://github.com/apache/apache-website-template 
but that stalled as the template is quite heavy.

> [Docs] Publish Format Markdown documents somehow on arrow.apache.org
> 
>
> Key: ARROW-798
> URL: https://issues.apache.org/jira/browse/ARROW-798
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-04-13 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967452#comment-15967452
 ] 

Uwe L. Korn commented on ARROW-300:
---

Adding methods like RLE- or Delta-encoding brings us very much in the space of 
Parquet. Given that some of these methods are really fast, it might make sense 
to support them for IPC. But then I fear that we will get very much in a region 
where there is no clear distinction between Arrow and Parquet anymore.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-751) [Python] Rename all Cython extensions to "private" status with leading underscore

2017-04-13 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-751.
---
Resolution: Fixed

Issue resolved by pull request 533
[https://github.com/apache/arrow/pull/533]

> [Python] Rename all Cython extensions to "private" status with leading 
> underscore
> -
>
> Key: ARROW-751
> URL: https://issues.apache.org/jira/browse/ARROW-751
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> We can do this after the dust settles with the in-flight patches, but it 
> would be good to have {{pyarrow._array}} instead of {{pyarrow.array}}. If we 
> need to expose a module "publicly" to the user, it would be better to do it 
> in pure Python (as we've done already with {{pyarrow._parquet}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-797) [Python] Add updated pyarrow.* public API listing in Sphinx docs

2017-04-13 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-797.
---
Resolution: Fixed

Issue resolved by pull request 535
[https://github.com/apache/arrow/pull/535]

> [Python] Add updated pyarrow.* public API listing in Sphinx docs
> 
>
> Key: ARROW-797
> URL: https://issues.apache.org/jira/browse/ARROW-797
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> like https://github.com/pandas-dev/pandas/blob/master/doc/source/api.rst



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)