Re: Intro to pandas + pyarrow integration?
In case it's interesting, I gave a talk a little over 3 years ago about this theme ("we all have data frames, but they're all different inside"): https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly. I mentioned the desire for a "Apache-licensed, community standard C/C++ data frame that we can all use". On Fri, Jul 6, 2018 at 1:53 PM, Alex Buchanan wrote: > Ok, interesting. Thanks Wes, that does make it clear. > > > For other readers, this github issue is related: > https://github.com/apache/arrow/issues/2189#issuecomment-402874836 > > > > On 7/6/18, 10:25 AM, "Wes McKinney" wrote: > >>hi Alex, >> >>One of the goals of Apache Arrow is to define an open standard for >>in-memory columnar data (which may be called "tables" or "data frames" >>in some domains). Among other things, the Arrow columnar format is >>optimized for memory efficiency and analytical processing performance >>on very large (even larger-than-RAM) data sets. >> >>The way to think about it is that pandas has its own in-memory >>representation for columnar data, but it is "proprietary" to pandas. >>To make use of pandas's analytical facilities, you must convert data >>to pandas's memory representation. As an example, pandas represents >>strings as NumPy arrays of Python string objects, which is very >>wasteful. Uwe Korn recently demonstrated an approach to using Arrow >>inside pandas, but this would require a lot of work to port algorithms >>to run against Arrow: https://github.com/xhochy/fletcher >> >>We are working to develop the standard data frame type operations as >>reusable libraries within this project, and these will run natively >>against the Arrow columnar format. This is a big project; we would >>love to have you involved with the effort. One of the reasons I have >>spent so much of my time the last few years on this project is that I >>believe it is the best path to build a faster, more efficient >>pandas-like library for data scientists. >> >>best, >>Wes >> >>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan wrote: >>> Hello all. >>> >>> I'm confused about the current level of integration between pandas and >>> pyarrow. Am I correct in understanding that currently I'll need to convert >>> pyarrow Tables to pandas DataFrames in order to use most of the pandas >>> features? By "pandas features" I mean every day slicing and dicing of >>> data: merge, filtering, melt, spread, etc. >>> >>> I have a dataframe which starts out from small files (< 1GB) and quickly >>> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm >>> interested in whether arrow can provide a better, optimized dataframe. >>> >>> Thanks. >>>
Re: Intro to pandas + pyarrow integration?
Ok, interesting. Thanks Wes, that does make it clear. For other readers, this github issue is related: https://github.com/apache/arrow/issues/2189#issuecomment-402874836 On 7/6/18, 10:25 AM, "Wes McKinney" wrote: >hi Alex, > >One of the goals of Apache Arrow is to define an open standard for >in-memory columnar data (which may be called "tables" or "data frames" >in some domains). Among other things, the Arrow columnar format is >optimized for memory efficiency and analytical processing performance >on very large (even larger-than-RAM) data sets. > >The way to think about it is that pandas has its own in-memory >representation for columnar data, but it is "proprietary" to pandas. >To make use of pandas's analytical facilities, you must convert data >to pandas's memory representation. As an example, pandas represents >strings as NumPy arrays of Python string objects, which is very >wasteful. Uwe Korn recently demonstrated an approach to using Arrow >inside pandas, but this would require a lot of work to port algorithms >to run against Arrow: https://github.com/xhochy/fletcher > >We are working to develop the standard data frame type operations as >reusable libraries within this project, and these will run natively >against the Arrow columnar format. This is a big project; we would >love to have you involved with the effort. One of the reasons I have >spent so much of my time the last few years on this project is that I >believe it is the best path to build a faster, more efficient >pandas-like library for data scientists. > >best, >Wes > >On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan wrote: >> Hello all. >> >> I'm confused about the current level of integration between pandas and >> pyarrow. Am I correct in understanding that currently I'll need to convert >> pyarrow Tables to pandas DataFrames in order to use most of the pandas >> features? By "pandas features" I mean every day slicing and dicing of data: >> merge, filtering, melt, spread, etc. >> >> I have a dataframe which starts out from small files (< 1GB) and quickly >> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm >> interested in whether arrow can provide a better, optimized dataframe. >> >> Thanks. >>
Re: Intro to pandas + pyarrow integration?
hi Alex, One of the goals of Apache Arrow is to define an open standard for in-memory columnar data (which may be called "tables" or "data frames" in some domains). Among other things, the Arrow columnar format is optimized for memory efficiency and analytical processing performance on very large (even larger-than-RAM) data sets. The way to think about it is that pandas has its own in-memory representation for columnar data, but it is "proprietary" to pandas. To make use of pandas's analytical facilities, you must convert data to pandas's memory representation. As an example, pandas represents strings as NumPy arrays of Python string objects, which is very wasteful. Uwe Korn recently demonstrated an approach to using Arrow inside pandas, but this would require a lot of work to port algorithms to run against Arrow: https://github.com/xhochy/fletcher We are working to develop the standard data frame type operations as reusable libraries within this project, and these will run natively against the Arrow columnar format. This is a big project; we would love to have you involved with the effort. One of the reasons I have spent so much of my time the last few years on this project is that I believe it is the best path to build a faster, more efficient pandas-like library for data scientists. best, Wes On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan wrote: > Hello all. > > I'm confused about the current level of integration between pandas and > pyarrow. Am I correct in understanding that currently I'll need to convert > pyarrow Tables to pandas DataFrames in order to use most of the pandas > features? By "pandas features" I mean every day slicing and dicing of data: > merge, filtering, melt, spread, etc. > > I have a dataframe which starts out from small files (< 1GB) and quickly > explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm > interested in whether arrow can provide a better, optimized dataframe. > > Thanks. >
Intro to pandas + pyarrow integration?
Hello all. I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features? By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc. I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe. Thanks.