[jira] [Commented] (ARROW-504) [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format

2017-01-23 Thread Matthew Rocklin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834694#comment-15834694
 ] 

Matthew Rocklin commented on ARROW-504:
---

At the moment I don't have any active use cases for this.  We tend to handle 
pandas dataframes as atomic blocks of data.

However generally I agree that streaming chunks in a more granular way is 
probably a better way to go.  Non-blocking IO quickly becomes blocking IO if 
data starts overflows local buffers.  This is the sort of technology that might 
influence future design decisions.

>From a pure Dask perspective my ideal serialization interface is Python object 
>-> iterator of memoryview objects.  

> [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to 
> streaming format
> --
>
> Key: ARROW-504
> URL: https://issues.apache.org/jira/browse/ARROW-504
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Wes McKinney
>
> While we can convert a {{pandas.DataFrame}} to a single (arbitrarily large) 
> {{arrow::RecordBatch}}, it is not easy to create multiple small record 
> batches -- we could do so in a streaming fashion and immediately write them 
> into an {{arrow::io::OutputStream}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ARROW-504) [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format

2017-01-23 Thread Matthew Rocklin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834694#comment-15834694
 ] 

Matthew Rocklin edited comment on ARROW-504 at 1/23/17 2:53 PM:


At the moment I don't have any active use cases for this.  We tend to handle 
pandas dataframes as atomic blocks of data.

However generally I agree that streaming chunks in a more granular way is 
probably a better way to go.  Non-blocking IO quickly becomes blocking IO if 
data starts overflowing local buffers.  This is the sort of technology that 
might influence future design decisions.

>From a pure Dask perspective my ideal serialization interface is Python object 
>-> iterator of memoryview objects.  


was (Author: mrocklin):
At the moment I don't have any active use cases for this.  We tend to handle 
pandas dataframes as atomic blocks of data.

However generally I agree that streaming chunks in a more granular way is 
probably a better way to go.  Non-blocking IO quickly becomes blocking IO if 
data starts overflows local buffers.  This is the sort of technology that might 
influence future design decisions.

>From a pure Dask perspective my ideal serialization interface is Python object 
>-> iterator of memoryview objects.  

> [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to 
> streaming format
> --
>
> Key: ARROW-504
> URL: https://issues.apache.org/jira/browse/ARROW-504
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Wes McKinney
>
> While we can convert a {{pandas.DataFrame}} to a single (arbitrarily large) 
> {{arrow::RecordBatch}}, it is not easy to create multiple small record 
> batches -- we could do so in a streaming fashion and immediately write them 
> into an {{arrow::io::OutputStream}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-81) [Format] Add a Category logical type (distinct from dictionary-encoding)

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-81.
---
Resolution: Fixed

Issue resolved by pull request 297
[https://github.com/apache/arrow/pull/297]

> [Format] Add a Category logical type (distinct from dictionary-encoding)
> 
>
> Key: ARROW-81
> URL: https://issues.apache.org/jira/browse/ARROW-81
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has 
> semantic meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of 
> the array. Typically there is an "ordered" boolean flag indicating whether 
> the order of the categories is meaningful.
> Category/factor types are used in a number of common statistical analyses. 
> See, for example, 
> http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a 
> basic requirement for Python and R, at least, as Arrow C++ consumers, to have 
> this type. Separately, we should consider what is necessary to be able to 
> transmit category data in IPCs -- possible an expansion of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-470) [Python] Add "FileSystem" abstraction to access directories of files in a uniform way

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-470.

Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/53a478dfb278dcae5ca7f300b70857662553d118 

> [Python] Add "FileSystem" abstraction to access directories of files in a 
> uniform way
> -
>
> Key: ARROW-470
> URL: https://issues.apache.org/jira/browse/ARROW-470
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> This will give local file system, HDFS, and eventually S3 the same basic API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (ARROW-508) [C++] Make file/memory-mapped file interfaces threadsafe

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-508:
--

Assignee: Wes McKinney

> [C++] Make file/memory-mapped file interfaces threadsafe
> 
>
> Key: ARROW-508
> URL: https://issues.apache.org/jira/browse/ARROW-508
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> There are a some functions which could be impacted by race conditions. In 
> light of PARQUET-835 we will want to hold a lock in the appropriate places



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-508) [C++] Make file/memory-mapped file interfaces threadsafe

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-508.

Resolution: Fixed

Issue resolved by pull request 300
[https://github.com/apache/arrow/pull/300]

> [C++] Make file/memory-mapped file interfaces threadsafe
> 
>
> Key: ARROW-508
> URL: https://issues.apache.org/jira/browse/ARROW-508
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>
> There are a some functions which could be impacted by race conditions. In 
> light of PARQUET-835 we will want to hold a lock in the appropriate places



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-503) [Python] Interface to streaming binary format

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-503.

Resolution: Fixed

Issue resolved by pull request 299
[https://github.com/apache/arrow/pull/299]

> [Python] Interface to streaming binary format
> -
>
> Key: ARROW-503
> URL: https://issues.apache.org/jira/browse/ARROW-503
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-506) Implement Arrow Echo server for integration testing

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-506.

Resolution: Fixed

Issue resolved by pull request 295
[https://github.com/apache/arrow/pull/295]

> Implement Arrow Echo server for integration testing
> ---
>
> Key: ARROW-506
> URL: https://issues.apache.org/jira/browse/ARROW-506
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java - Vectors
>Reporter: Nong Li
>Assignee: Nong Li
>
> It would be convenient to have an test utility that would run an Arrow echo 
> server that receives over the socket arrow streams and then echoes them back. 
> This would exercise the serialize and deserialize code on the server and the 
> client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-494) [C++] When MemoryMappedFile is destructed, memory is unmapped even if buffer referecnes still exist

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-494.

Resolution: Fixed

Issue resolved by pull request 298
[https://github.com/apache/arrow/pull/298]

> [C++] When MemoryMappedFile is destructed, memory is unmapped even if buffer 
> referecnes still exist
> ---
>
> Key: ARROW-494
> URL: https://issues.apache.org/jira/browse/ARROW-494
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> I'd like to see if there is some way to "protect" the memory map from 
> premature destruction. This is a slight artifact of MemoryMappedFile's 
> implementation sharing code with normal on-disk files (which read into 
> allocated memory), i.e. the `Close` function unmaps the memory and closes the 
> file handle. This would amount to creating a Buffer subclass that retains 
> ownership of the file descriptor and memory map, so that if any Buffer still 
> references the memory map, then `MemoryMappedFile::Close` will not unmap the 
> memory or close the file. But then the unmapping / file close would need to 
> happen when the last Buffer reference is destroyed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ARROW-475) [Python] High level support for reading directories of Parquet files (as a single Arrow table) from supported file system interfaces

2017-01-23 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-475.

Resolution: Fixed

Issue resolved by pull request 296
[https://github.com/apache/arrow/pull/296]

> [Python] High level support for reading directories of Parquet files (as a 
> single Arrow table) from supported file system interfaces
> 
>
> Key: ARROW-475
> URL: https://issues.apache.org/jira/browse/ARROW-475
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> This is the end result of a bunch of associated work both in parquet-cpp and 
> Arrow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)