[jira] [Commented] (ARROW-376) Python: Convert non-range Pandas indices (optionally) to Arrow

2017-03-02 Thread Jim Ahn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893699#comment-15893699
 ] 

Jim Ahn commented on ARROW-376:
---

Oops.  Wes, I did not see your comment until I posted mine.  No worries. I'll 
move on to another task.  Thanks.

> Python: Convert non-range Pandas indices (optionally) to Arrow
> --
>
> Key: ARROW-376
> URL: https://issues.apache.org/jira/browse/ARROW-376
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: newbie
> Fix For: 0.3.0
>
>
> Currently the indices of a Pandas DataFrame are totally ignored on the Pandas 
> to Arrow conversion. We should add an option to also convert the index to an 
> Arrow column if they are not a simple range index.
> The condition for a simple index should be {{isinstance(df.index, 
> pd.RangeIndex) && (df.index._start == 0) && (df.index._stop == len(df.index)) 
> && (df.index._step == 1)}}. In this case, we can always skip the index 
> conversion. Otherwise, a new column in the Arrow table shall be created using 
> the index' name as the name of the column. Additionally there should be some 
> metadata annotation of that column that it is derived of an Pandas Index, so 
> that for roundtrips, we'll use it again as the index of a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-593) [C++] Rename ReadableFileInterface to RandomAccessFile

2017-03-02 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-593:
--

 Summary: [C++] Rename ReadableFileInterface to RandomAccessFile
 Key: ARROW-593
 URL: https://issues.apache.org/jira/browse/ARROW-593
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.3.0


I think this makes it more clear what instances of the base class are



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-376) Python: Convert non-range Pandas indices (optionally) to Arrow

2017-03-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-376:
---
Fix Version/s: 0.3.0

> Python: Convert non-range Pandas indices (optionally) to Arrow
> --
>
> Key: ARROW-376
> URL: https://issues.apache.org/jira/browse/ARROW-376
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Jim Ahn
>Priority: Minor
>  Labels: newbie
> Fix For: 0.3.0
>
>
> Currently the indices of a Pandas DataFrame are totally ignored on the Pandas 
> to Arrow conversion. We should add an option to also convert the index to an 
> Arrow column if they are not a simple range index.
> The condition for a simple index should be {{isinstance(df.index, 
> pd.RangeIndex) && (df.index._start == 0) && (df.index._stop == len(df.index)) 
> && (df.index._step == 1)}}. In this case, we can always skip the index 
> conversion. Otherwise, a new column in the Arrow table shall be created using 
> the index' name as the name of the column. Additionally there should be some 
> metadata annotation of that column that it is derived of an Pandas Index, so 
> that for roundtrips, we'll use it again as the index of a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-376) Python: Convert non-range Pandas indices (optionally) to Arrow

2017-03-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893223#comment-15893223
 ] 

Wes McKinney commented on ARROW-376:


[~ahnj] if you don't mind, I will take care of this one. It requires a bit of 
work to expose the {{custom_metadata}} fields in the file metadata

> Python: Convert non-range Pandas indices (optionally) to Arrow
> --
>
> Key: ARROW-376
> URL: https://issues.apache.org/jira/browse/ARROW-376
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: newbie
> Fix For: 0.3.0
>
>
> Currently the indices of a Pandas DataFrame are totally ignored on the Pandas 
> to Arrow conversion. We should add an option to also convert the index to an 
> Arrow column if they are not a simple range index.
> The condition for a simple index should be {{isinstance(df.index, 
> pd.RangeIndex) && (df.index._start == 0) && (df.index._stop == len(df.index)) 
> && (df.index._step == 1)}}. In this case, we can always skip the index 
> conversion. Otherwise, a new column in the Arrow table shall be created using 
> the index' name as the name of the column. Additionally there should be some 
> metadata annotation of that column that it is derived of an Pandas Index, so 
> that for roundtrips, we'll use it again as the index of a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-376) Python: Convert non-range Pandas indices (optionally) to Arrow

2017-03-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-376:
--

Assignee: Wes McKinney  (was: Jim Ahn)

> Python: Convert non-range Pandas indices (optionally) to Arrow
> --
>
> Key: ARROW-376
> URL: https://issues.apache.org/jira/browse/ARROW-376
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: newbie
> Fix For: 0.3.0
>
>
> Currently the indices of a Pandas DataFrame are totally ignored on the Pandas 
> to Arrow conversion. We should add an option to also convert the index to an 
> Arrow column if they are not a simple range index.
> The condition for a simple index should be {{isinstance(df.index, 
> pd.RangeIndex) && (df.index._start == 0) && (df.index._stop == len(df.index)) 
> && (df.index._step == 1)}}. In this case, we can always skip the index 
> conversion. Otherwise, a new column in the Arrow table shall be created using 
> the index' name as the name of the column. Additionally there should be some 
> metadata annotation of that column that it is derived of an Pandas Index, so 
> that for roundtrips, we'll use it again as the index of a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-452) [C++/Python] Merge "Feather" file format implementation

2017-03-02 Thread Matthew Rocklin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Rocklin reassigned ARROW-452:
-

Assignee: (was: Matthew Rocklin)

> [C++/Python] Merge "Feather" file format implementation
> ---
>
> Key: ARROW-452
> URL: https://issues.apache.org/jira/browse/ARROW-452
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>
> See https://github.com/wesm/feather/tree/master/cpp -- this will assist with 
> code consolidation and reconciling metadata requirements for Python and R 
> users, with the goal of eventually using the Arrow IPC format for everything 
> and deprecating the less-flexible Feather format / metadata. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-376) Python: Convert non-range Pandas indices (optionally) to Arrow

2017-03-02 Thread Matthew Rocklin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893208#comment-15893208
 ] 

Matthew Rocklin commented on ARROW-376:
---

I would love to see this issue get higher priority.  I would like to experiment 
with using Arrow as Dask's network serialization format for pandas dataframes 
if it were implemented.  I think we would see good speed boosts on 
communication heavy workloads like shuffles.  This would be fun to write about 
afterwards.

> Python: Convert non-range Pandas indices (optionally) to Arrow
> --
>
> Key: ARROW-376
> URL: https://issues.apache.org/jira/browse/ARROW-376
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Jim Ahn
>Priority: Minor
>  Labels: newbie
>
> Currently the indices of a Pandas DataFrame are totally ignored on the Pandas 
> to Arrow conversion. We should add an option to also convert the index to an 
> Arrow column if they are not a simple range index.
> The condition for a simple index should be {{isinstance(df.index, 
> pd.RangeIndex) && (df.index._start == 0) && (df.index._stop == len(df.index)) 
> && (df.index._step == 1)}}. In this case, we can always skip the index 
> conversion. Otherwise, a new column in the Arrow table shall be created using 
> the index' name as the name of the column. Additionally there should be some 
> metadata annotation of that column that it is derived of an Pandas Index, so 
> that for roundtrips, we'll use it again as the index of a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-452) [C++/Python] Merge "Feather" file format implementation

2017-03-02 Thread Matthew Rocklin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Rocklin reassigned ARROW-452:
-

Assignee: Matthew Rocklin  (was: Wes McKinney)

> [C++/Python] Merge "Feather" file format implementation
> ---
>
> Key: ARROW-452
> URL: https://issues.apache.org/jira/browse/ARROW-452
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Matthew Rocklin
>
> See https://github.com/wesm/feather/tree/master/cpp -- this will assist with 
> code consolidation and reconciling metadata requirements for Python and R 
> users, with the goal of eventually using the Arrow IPC format for everything 
> and deprecating the less-flexible Feather format / metadata. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-576) [C++] Complete round trip Union file/stream IPC tests

2017-03-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-576.

Resolution: Fixed

Issue resolved by pull request 356
[https://github.com/apache/arrow/pull/356]

> [C++] Complete round trip Union file/stream IPC tests
> -
>
> Key: ARROW-576
> URL: https://issues.apache.org/jira/browse/ARROW-576
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> I encountered this working on ARROW-459 -- while we have the ability to write 
> and read record batches, the schema/metadata parts of the file and stream 
> reader/writer implementations are not complete. I did a bit of work on it in 
> my patch for ARROW-459 but gave up as I realized there was more work to do 
> than made sense for that patch (which is for dictionary-encoded data). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-592) [C++] Provide .deb and .rpm packages

2017-03-02 Thread Kouhei Sutou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892371#comment-15892371
 ] 

Kouhei Sutou commented on ARROW-592:


> do you have the expertise to write the package recipes?

Yes.

We need to decide our package repository location before we create the package 
recipes. For example, if we choose packagecloud, we should not use CMake's 
CPack. We should use debian/ directory and .spec file.

> [C++] Provide .deb and .rpm packages
> 
>
> Key: ARROW-592
> URL: https://issues.apache.org/jira/browse/ARROW-592
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
> Environment: GNU/Linux
>Reporter: Kouhei Sutou
>Priority: Minor
>
> If we provide .deb and .rpm packages for C++ Arrow, users can install it 
> easily. (At least, I'm happy as an user.)
> Is there any location to provide .deb and .rpm packages? If it doesn't exist, 
> how about using https://packagecloud.io/ with "Open Source plan"? We can find 
> "Open Source plan" by clicking "Looking for free or open-source plans" at 
> https://packagecloud.io/pricing .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-592) [C++] Provide .deb and .rpm packages

2017-03-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892315#comment-15892315
 ] 

Wes McKinney commented on ARROW-592:


I agree. For some Python and R users, installing the Arrow libraries via apt or 
yum will be much better than bundling the Arrow sources with a 3rd party 
library.

[~kou] do you have the expertise to write the package recipes? I can try to help

> [C++] Provide .deb and .rpm packages
> 
>
> Key: ARROW-592
> URL: https://issues.apache.org/jira/browse/ARROW-592
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
> Environment: GNU/Linux
>Reporter: Kouhei Sutou
>Priority: Minor
>
> If we provide .deb and .rpm packages for C++ Arrow, users can install it 
> easily. (At least, I'm happy as an user.)
> Is there any location to provide .deb and .rpm packages? If it doesn't exist, 
> how about using https://packagecloud.io/ with "Open Source plan"? We can find 
> "Open Source plan" by clicking "Looking for free or open-source plans" at 
> https://packagecloud.io/pricing .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-592) [C++] Provide .deb and .rpm packages

2017-03-02 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-592:
--

 Summary: [C++] Provide .deb and .rpm packages
 Key: ARROW-592
 URL: https://issues.apache.org/jira/browse/ARROW-592
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
 Environment: GNU/Linux
Reporter: Kouhei Sutou
Priority: Minor


If we provide .deb and .rpm packages for C++ Arrow, users can install it 
easily. (At least, I'm happy as an user.)

Is there any location to provide .deb and .rpm packages? If it doesn't exist, 
how about using https://packagecloud.io/ with "Open Source plan"? We can find 
"Open Source plan" by clicking "Looking for free or open-source plans" at 
https://packagecloud.io/pricing .




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)