See thread on general@incubator https://lists.apache.org/thread.html/r3108dd293240967cab4d75a8003895b247b3b3b726a7e1e54f3d9b65%40%3Cgeneral.incubator.apache.org%3E
On Tue, May 4, 2021 at 9:35 AM Wes McKinney <wesmck...@gmail.com> wrote: > > I admit it's an unusual situation to have a single-author codebase > where the developer is on the PMC, let's determine what is the > protocol for this kind of thing in the future so we don't create > unnecessary work for ourselves. > > On Tue, May 4, 2021 at 9:15 AM Andy Grove <andygrov...@gmail.com> wrote: > > > > I apologize. For some reason, I had thought that because Jorge was the only > > contributor (except for one contribution fixing a typo in the README) that > > the IP clearance process did not apply in this case. > > > > I will create a PR to revert. > > > > On Tue, May 4, 2021 at 8:06 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > Just to circle back on this. Since this was an independent codebase > > > previously developed over a 10 month period, I had assumed we would be > > > looking at an IP clearance vote, but instead it was just merged into > > > arrow-datafusion. > > > > > > On Tue, Apr 27, 2021 at 10:50 AM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > > > > Hi Jorge, > > > > This all sounds good to me. It might be nice to test against both the > > > > pinned released version of pyarrow and at head if possible. > > > > > > > > I like the idea of not causing release churn as long as all the > > > underlying > > > > libraries are compatible. > > > > > > > > Thanks for the write up. > > > > > > > > -Micah > > > > > > > > On Mon, Apr 26, 2021 at 10:30 AM Jorge Cardoso Leitão < > > > > jorgecarlei...@gmail.com> wrote: > > > > > > > > > Hi Micah, > > > > > > > > > > All testing is actually done from Python: create a record batch in > > > > > pyarrow, push it to datafusion, > > > > > consume it back in Python, and compare the result using pyarrows' > > > > > equality. Sometimes parquet is used instead. > > > > > The library is tested against pyarrow==1 from pypi: we can bump that, > > > but > > > > > if it works in pyarrow==1, > > > > > chances are things will improve with higher versions :) > > > > > > > > > > Releases: I thought to have it released as a separate wheel for two > > > > > reasons: > > > > > > > > > > * not force people that want pyarrow to download datafusion binaries > > > with > > > > > it > > > > > * have independent versioning from pyarrow > > > > > > > > > > and "bracked" the pyarrow that we ensure compatibility with. > > > > > > > > > > Another alternative is to release with the same versioning as > > > datafusion, > > > > > like arrow c++ / pyarrow and spark / pyspark. > > > > > The upside is that the versions are aligned. The downside is that we > > > will > > > > > be releasing a lot of majors for no reason: so far, all backward > > > > > incompatible changes in datafusion were not backward incompatible in > > > > > python-datafusion: it is easier to break backward compat. in a Rust > > > library > > > > > than it is in a Python wrapper to a Rust library. > > > > > > > > > > What are your thoughts, Micah? > > > > > > > > > > Best, > > > > > Jorge > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Apr 25, 2021 at 10:32 PM Micah Kornfield < > > > emkornfi...@gmail.com> > > > > > wrote: > > > > > > > > > >> Hi Jorge, > > > > >> I think this would certainly be a valuable contribution. How were > > > > >> you > > > > >> thinking of hosting (which repo)/publishing it (maintaintaining a > > > separate > > > > >> wheel)? Also did you have thoughts integration testing with pyarrow? > > > > >> > > > > >> Cheers, > > > > >> Micah > > > > >> > > > > >> On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão < > > > > >> jorgecarlei...@gmail.com> wrote: > > > > >> > > > > >> > Hi, > > > > >> > > > > > >> > I fielded a PR [1] to open up a discussion to incorporate > > > > >> python-datafusion > > > > >> > [2] into the Apache Arrow project. > > > > >> > > > > > >> > Python-datafusion is a Python library [3] built on top of > > > DataFusions > > > > >> that > > > > >> > enables people to use DataFusion from Python. It leverages the C > > > data > > > > >> > interface for zero-cost copy between DataFusion and pyarrow (a > > > bunch of > > > > >> > pointers is shared around). > > > > >> > > > > > >> > For example, it allows users to read a CSV from Rust, pass the > > > arrays > > > > >> to a > > > > >> > C++ kernel, continue the computation in Rust's kernels, and export > > > to > > > > >> > parquet using Rust (or C++ parquet, or whatever ^_^). It supports > > > UDFs > > > > >> and > > > > >> > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas, > > > numpy or > > > > >> > tensorflow. =) > > > > >> > > > > > >> > Best, > > > > >> > Jorge > > > > >> > > > > > >> > [1] https://github.com/apache/arrow-datafusion/pull/69 > > > > >> > [2] https://github.com/jorgecarleitao/datafusion-python > > > > >> > [3] https://pypi.org/project/datafusion/ > > > > >> > > > > > >> > > > > > > > >