[jira] [Commented] (ARROW-504) [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format
[ https://issues.apache.org/jira/browse/ARROW-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834694#comment-15834694 ] Matthew Rocklin commented on ARROW-504: --- At the moment I don't have any active use cases for this. We tend to handle pandas dataframes as atomic blocks of data. However generally I agree that streaming chunks in a more granular way is probably a better way to go. Non-blocking IO quickly becomes blocking IO if data starts overflows local buffers. This is the sort of technology that might influence future design decisions. >From a pure Dask perspective my ideal serialization interface is Python object >-> iterator of memoryview objects. > [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to > streaming format > -- > > Key: ARROW-504 > URL: https://issues.apache.org/jira/browse/ARROW-504 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Wes McKinney > > While we can convert a {{pandas.DataFrame}} to a single (arbitrarily large) > {{arrow::RecordBatch}}, it is not easy to create multiple small record > batches -- we could do so in a streaming fashion and immediately write them > into an {{arrow::io::OutputStream}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (ARROW-504) [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format
[ https://issues.apache.org/jira/browse/ARROW-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834694#comment-15834694 ] Matthew Rocklin edited comment on ARROW-504 at 1/23/17 2:53 PM: At the moment I don't have any active use cases for this. We tend to handle pandas dataframes as atomic blocks of data. However generally I agree that streaming chunks in a more granular way is probably a better way to go. Non-blocking IO quickly becomes blocking IO if data starts overflowing local buffers. This is the sort of technology that might influence future design decisions. >From a pure Dask perspective my ideal serialization interface is Python object >-> iterator of memoryview objects. was (Author: mrocklin): At the moment I don't have any active use cases for this. We tend to handle pandas dataframes as atomic blocks of data. However generally I agree that streaming chunks in a more granular way is probably a better way to go. Non-blocking IO quickly becomes blocking IO if data starts overflows local buffers. This is the sort of technology that might influence future design decisions. >From a pure Dask perspective my ideal serialization interface is Python object >-> iterator of memoryview objects. > [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to > streaming format > -- > > Key: ARROW-504 > URL: https://issues.apache.org/jira/browse/ARROW-504 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Wes McKinney > > While we can convert a {{pandas.DataFrame}} to a single (arbitrarily large) > {{arrow::RecordBatch}}, it is not easy to create multiple small record > batches -- we could do so in a streaming fashion and immediately write them > into an {{arrow::io::OutputStream}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-81) [Format] Add a Category logical type (distinct from dictionary-encoding)
[ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-81. --- Resolution: Fixed Issue resolved by pull request 297 [https://github.com/apache/arrow/pull/297] > [Format] Add a Category logical type (distinct from dictionary-encoding) > > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has > semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of > the array. Typically there is an "ordered" boolean flag indicating whether > the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. > See, for example, > http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a > basic requirement for Python and R, at least, as Arrow C++ consumers, to have > this type. Separately, we should consider what is necessary to be able to > transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-470) [Python] Add "FileSystem" abstraction to access directories of files in a uniform way
[ https://issues.apache.org/jira/browse/ARROW-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-470. Resolution: Fixed Resolved in https://github.com/apache/arrow/commit/53a478dfb278dcae5ca7f300b70857662553d118 > [Python] Add "FileSystem" abstraction to access directories of files in a > uniform way > - > > Key: ARROW-470 > URL: https://issues.apache.org/jira/browse/ARROW-470 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > > This will give local file system, HDFS, and eventually S3 the same basic API -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (ARROW-508) [C++] Make file/memory-mapped file interfaces threadsafe
[ https://issues.apache.org/jira/browse/ARROW-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-508: -- Assignee: Wes McKinney > [C++] Make file/memory-mapped file interfaces threadsafe > > > Key: ARROW-508 > URL: https://issues.apache.org/jira/browse/ARROW-508 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > > There are a some functions which could be impacted by race conditions. In > light of PARQUET-835 we will want to hold a lock in the appropriate places -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-508) [C++] Make file/memory-mapped file interfaces threadsafe
[ https://issues.apache.org/jira/browse/ARROW-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-508. Resolution: Fixed Issue resolved by pull request 300 [https://github.com/apache/arrow/pull/300] > [C++] Make file/memory-mapped file interfaces threadsafe > > > Key: ARROW-508 > URL: https://issues.apache.org/jira/browse/ARROW-508 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > > There are a some functions which could be impacted by race conditions. In > light of PARQUET-835 we will want to hold a lock in the appropriate places -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-503) [Python] Interface to streaming binary format
[ https://issues.apache.org/jira/browse/ARROW-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-503. Resolution: Fixed Issue resolved by pull request 299 [https://github.com/apache/arrow/pull/299] > [Python] Interface to streaming binary format > - > > Key: ARROW-503 > URL: https://issues.apache.org/jira/browse/ARROW-503 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-506) Implement Arrow Echo server for integration testing
[ https://issues.apache.org/jira/browse/ARROW-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-506. Resolution: Fixed Issue resolved by pull request 295 [https://github.com/apache/arrow/pull/295] > Implement Arrow Echo server for integration testing > --- > > Key: ARROW-506 > URL: https://issues.apache.org/jira/browse/ARROW-506 > Project: Apache Arrow > Issue Type: Task > Components: Java - Vectors >Reporter: Nong Li >Assignee: Nong Li > > It would be convenient to have an test utility that would run an Arrow echo > server that receives over the socket arrow streams and then echoes them back. > This would exercise the serialize and deserialize code on the server and the > client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-494) [C++] When MemoryMappedFile is destructed, memory is unmapped even if buffer referecnes still exist
[ https://issues.apache.org/jira/browse/ARROW-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-494. Resolution: Fixed Issue resolved by pull request 298 [https://github.com/apache/arrow/pull/298] > [C++] When MemoryMappedFile is destructed, memory is unmapped even if buffer > referecnes still exist > --- > > Key: ARROW-494 > URL: https://issues.apache.org/jira/browse/ARROW-494 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > > I'd like to see if there is some way to "protect" the memory map from > premature destruction. This is a slight artifact of MemoryMappedFile's > implementation sharing code with normal on-disk files (which read into > allocated memory), i.e. the `Close` function unmaps the memory and closes the > file handle. This would amount to creating a Buffer subclass that retains > ownership of the file descriptor and memory map, so that if any Buffer still > references the memory map, then `MemoryMappedFile::Close` will not unmap the > memory or close the file. But then the unmapping / file close would need to > happen when the last Buffer reference is destroyed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ARROW-475) [Python] High level support for reading directories of Parquet files (as a single Arrow table) from supported file system interfaces
[ https://issues.apache.org/jira/browse/ARROW-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-475. Resolution: Fixed Issue resolved by pull request 296 [https://github.com/apache/arrow/pull/296] > [Python] High level support for reading directories of Parquet files (as a > single Arrow table) from supported file system interfaces > > > Key: ARROW-475 > URL: https://issues.apache.org/jira/browse/ARROW-475 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > > This is the end result of a bunch of associated work both in parquet-cpp and > Arrow -- This message was sent by Atlassian JIRA (v6.3.4#6332)