Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
On 9 April 2024 at 18:45, Jose Manuel Abuin Mosquera wrote: | If possible, I would like to contribute. At work we use the Go and | Python implementations, also, in the short term, we will start using the | Rust one. Similar for us, and we have seen plenty of build headaches across pypi or conda ... (Hence my earlier hint about nanoarrow. No linking, uses the C API of two void pointers.) | Just to point out, the Rust version has its own native implementation, | here: https://github.com/apache/arrow-rs . And IIRC there is an independent Arrow implementation (in Rust) used by polars making it two possible ITPs: vanilla Arrow from Apache and Arrow from polars. Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
O 25/03/24 ás 19:17, Julian Gilbey escribiu: Hi all, Hi :) [NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to set to d-science and the RFP bug only] An update on Apache Arrow, and in particular the Python library PyArrow. For those who don't know: Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as: * Zero-copy shared memory and RPC-based data movement * Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet) * In-memory analytics and query processing (from: https://arrow.apache.org/docs/index.html) Pandas has announced that Pandas 3.x will depend on PyArrow in a critical way (it will back the "string" datatype), and it is due to be released imminently. So this is a plea for anyone looking for something really helpful to do: it would be great to have a group of developers finally package this! There was some initial work done (see the RFP bug report for details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021), but that is fairly old now. As Apache Arrow supports numerous languages, it may well benefit from having a group of developers with different areas of expertise to build it. (Or perhaps it would make more sense to split the upstream source into a collection of different Debian source packages for the different supported languages. I don't know.) Unfortunately I don't have the capacity to devote any time to it myself. Thanks in advance for anyone who can step forward for this! Best wishes, Julian If possible, I would like to contribute. At work we use the Go and Python implementations, also, in the short term, we will start using the Rust one. Just to point out, the Rust version has its own native implementation, here: https://github.com/apache/arrow-rs . Cheers, Jose -- José Manuel Abuín Mosquera PhD. | Scientific Software Developer | Researcher http://jmabuin.github.io
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
On 3/25/24 19:17, Julian Gilbey wrote: Hi all, [NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to set to d-science and the RFP bug only] An update on Apache Arrow, and in particular the Python library PyArrow. For those who don't know: Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as: * Zero-copy shared memory and RPC-based data movement * Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet) * In-memory analytics and query processing (from: https://arrow.apache.org/docs/index.html) Pandas has announced that Pandas 3.x will depend on PyArrow in a critical way (it will back the "string" datatype), and it is due to be released imminently. So this is a plea for anyone looking for something really helpful to do: it would be great to have a group of developers finally package this! There was some initial work done (see the RFP bug report for details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021), but that is fairly old now. As Apache Arrow supports numerous languages, it may well benefit from having a group of developers with different areas of expertise to build it. (Or perhaps it would make more sense to split the upstream source into a collection of different Debian source packages for the different supported languages. I don't know.) Unfortunately I don't have the capacity to devote any time to it myself. Thanks in advance for anyone who can step forward for this! Best wishes, Julian Hi, I may not have much available time to help, though I'd love to have Arrow in Debian, as Ceph uses it, and currently use an embedded version. Cheers, Thomas Goirand (zigo)
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
On 3/25/24 7:17 PM, Julian Gilbey wrote: So this is a plea for anyone looking for something really helpful to do: it would be great to have a group of developers finally package this! There was some initial work done (see the RFP bug report for details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021), but that is fairly old now. As Apache Arrow supports numerous languages, it may well benefit from having a group of developers with different areas of expertise to build it. (Or perhaps it would make more sense to split the upstream source into a collection of different Debian source packages for the different supported languages. I don't know.) Unfortunately I don't have the capacity to devote any time to it myself. Thanks in advance for anyone who can step forward for this! As someone from the Debian-GIS community, I would also be very interested in this! The Apache Arrow C++ library is one of the dependencies to make GDAL/OGR able to read/write (geo)parquet files, a data format with a lot traction in the geo community [0]. Thereby making it possible for QGIS to handle those (on Debian). [0] https://cloudnativegeo.org/blog/2023/09/duckdb-the-indispensable-geospatial-tool-you-didnt-know-you-were-missing/ Regards, Richard Duivenvoorde
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Julian, Arrow is a complicated and large package. We use it at work (where there is a fair amount of Python, also to Conda etc) and do have issues with more complex builds especially because it is 'data infrastructure' and can come in from different parts. I would recommend against packaging at old one -- we also have seen issues with different (py)arrow version biting. Have you seen https://github.com/apache/arrow-nanoarrow ? It works via the C API to Arrow which interchanges data via two void* to the the two structs for arrow array and schema -- and avoids linkage issue. (In user space the pyarrow or R arrow packages can still be used also interfacing via these.) I have been using it for R package bindings for some time and we plan to expand that (again, at work) -- as do others. It is already use by duckdb, by the Arrow 'ADBC' interfaces (which are generic in the ODBC/JDBC sense but for Arrow, and also by a python interface to snowflake. Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Hi Diane, On Sat, Mar 30, 2024 at 08:59:39PM -0700, Diane Trout wrote: > Hi Julian, > > On Sat, 2024-03-30 at 20:22 +, Julian Gilbey wrote: > > Lovely to hear from you, and oh wow, that's amazing, thank you! > > > > I can't speak for anyone else, but I suggest that pushing your > > updates > > to the science-team package would be very sensible; it would be silly > > for someone else to have to redo your work. > > > > What more is needed for it to be ready for unstable? > > > The things I think are kind of broken are: > > We've got 7.0.0 and upstreams current version is 15.0.2. Yes, that does seem a little less than ideal! > the pyarrow 7.0.0 tests fail because it depends on a python test > library that breaks with pytest 8.0. Either I need to disable the > python tests or upgrade to a newer version. It may well be that newer versions would work with pytest 8.x. I don't think it's worth spending time trying to patch such a relatively old version. > My upgrade didn't go smoothly because uscan found also upstreams debian > watch file which is too loose and matches some other tar balls on their > distribution site. > > (Though I don't know why uscan keeps looking for watch files after > finding one in debian/watch) Oh dear. uscan(1) does say: Unless --watchfile is given, uscan looks recursively for valid source trees starting from the current directory (see the below section "Directory name checking" for details). and then: For each valid source tree found, typically the following happens: [...] so yes, it will look at more than one location. > And you were probably right in that arrow needs to be a team, because I > have no idea how to get other the other languages interfaces packaged. I suggest that without anyone else volunteering to do those other language interfaces (perhaps it's not a pressing need for people working with language X), I wonder whether it's worth just packaging the Python (and presumably C++) interfaces for now, and then if others want to join the effort to support language X later on, a new version of the Debian package can be uploaded with a new binary package for language X. It does mean more trips through the NEW queue if and when that happens, but given that no-one's shown interest in language X for the last several years, this is unlikely to be much of an issue. Version 7.0 provided support (it seems) for: GLib (seems that a draft framework for building this is already in the Debian package, and it can then be used in lots of languages), C++ (this is the core libraries), C# (not of interest to us), Go, Java, JavaScript, Julia, Matlab (not of interest to us), Python, R, Ruby. > Oh and I probably need to get the pyarrow installed somewhere, since it > was stopping at the tests I hadn't run into dh_missing errors yet. Oh. Would pybuild do that automatically (perhaps specifying PYBUILD_PACKAGE)? Best wishes, Julian
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Hi Julian, On Sat, 2024-03-30 at 20:22 +, Julian Gilbey wrote: > Lovely to hear from you, and oh wow, that's amazing, thank you! > > I can't speak for anyone else, but I suggest that pushing your > updates > to the science-team package would be very sensible; it would be silly > for someone else to have to redo your work. > > What more is needed for it to be ready for unstable? The things I think are kind of broken are: We've got 7.0.0 and upstreams current version is 15.0.2. the pyarrow 7.0.0 tests fail because it depends on a python test library that breaks with pytest 8.0. Either I need to disable the python tests or upgrade to a newer version. My upgrade didn't go smoothly because uscan found also upstreams debian watch file which is too loose and matches some other tar balls on their distribution site. (Though I don't know why uscan keeps looking for watch files after finding one in debian/watch) And you were probably right in that arrow needs to be a team, because I have no idea how to get other the other languages interfaces packaged. Oh and I probably need to get the pyarrow installed somewhere, since it was stopping at the tests I hadn't run into dh_missing errors yet. Diane
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Hi Diane, On Fri, Mar 29, 2024 at 11:49:07AM -0700, Diane Trout wrote: > On Mon, 2024-03-25 at 18:17 +, Julian Gilbey wrote: > > > > > > So this is a plea for anyone looking for something really helpful to > > do: it would be great to have a group of developers finally package > > this! There was some initial work done (see the RFP bug report for > > details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021), > > but that is fairly old now. As Apache Arrow supports numerous > > languages, it may well benefit from having a group of developers with > > different areas of expertise to build it. (Or perhaps it would make > > more sense to split the upstream source into a collection of > > different > > Debian source packages for the different supported languages. I > > don't > > know.) Unfortunately I don't have the capacity to devote any time to > > it myself. > > > > Thanks in advance for anyone who can step forward for this! > > I've been maintain dask and anndata and saw that apache arrow was > getting increasingly popular. > > I took the current science-team preliminary packaging 7.0.0 packaging > and managed to get it to build through a combination of patches and > turning off features. > > I even mostly managed to get pyarrow to build. (Though some tests fail > due to pytest lazy-fixture being abandoned). > > I pushed my current work in progress to. > > https://salsa.debian.org/diane/arrow.git > > Was anyone else planning on working on it or should I push my updates > to the science-team package? Lovely to hear from you, and oh wow, that's amazing, thank you! I can't speak for anyone else, but I suggest that pushing your updates to the science-team package would be very sensible; it would be silly for someone else to have to redo your work. What more is needed for it to be ready for unstable? Best wishes, Julian
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Hi, Am 25.03.24 um 19:17 schrieb Julian Gilbey: * Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet) liborcus supports this (Apache Parquet) if built with Apache Arrow. And thus makes LibreOffice being able to handle it. I didn't invest any time in Apache Arrow since I am already too low on time anyway and I deemed it too a "low popularity" thing anyway. So this is a plea for anyone looking for something really helpful to do: it would be great to have a group of developers finally package this! Indeed. There was some initial work done (see the RFP bug report for details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021), but that is fairly old now. As Apache Arrow supports numerous languages, it may well benefit from having a group of developers with different areas of expertise to build it. (Or perhaps it would make more sense to split the upstream source into a collection of different Debian source packages for the different supported languages. I don't know.) Would definitely make transitions easier. Unfortunately I don't have the capacity to devote any time to it myself. Dito. Regards, Rene
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
On Mon, 2024-03-25 at 18:17 +, Julian Gilbey wrote: > > > So this is a plea for anyone looking for something really helpful to > do: it would be great to have a group of developers finally package > this! There was some initial work done (see the RFP bug report for > details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021), > but that is fairly old now. As Apache Arrow supports numerous > languages, it may well benefit from having a group of developers with > different areas of expertise to build it. (Or perhaps it would make > more sense to split the upstream source into a collection of > different > Debian source packages for the different supported languages. I > don't > know.) Unfortunately I don't have the capacity to devote any time to > it myself. > > Thanks in advance for anyone who can step forward for this! I've been maintain dask and anndata and saw that apache arrow was getting increasingly popular. I took the current science-team preliminary packaging 7.0.0 packaging and managed to get it to build through a combination of patches and turning off features. I even mostly managed to get pyarrow to build. (Though some tests fail due to pytest lazy-fixture being abandoned). I pushed my current work in progress to. https://salsa.debian.org/diane/arrow.git Was anyone else planning on working on it or should I push my updates to the science-team package? Diane
Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)
Hi all, [NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to set to d-science and the RFP bug only] An update on Apache Arrow, and in particular the Python library PyArrow. For those who don't know: Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as: * Zero-copy shared memory and RPC-based data movement * Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet) * In-memory analytics and query processing (from: https://arrow.apache.org/docs/index.html) Pandas has announced that Pandas 3.x will depend on PyArrow in a critical way (it will back the "string" datatype), and it is due to be released imminently. So this is a plea for anyone looking for something really helpful to do: it would be great to have a group of developers finally package this! There was some initial work done (see the RFP bug report for details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021), but that is fairly old now. As Apache Arrow supports numerous languages, it may well benefit from having a group of developers with different areas of expertise to build it. (Or perhaps it would make more sense to split the upstream source into a collection of different Debian source packages for the different supported languages. I don't know.) Unfortunately I don't have the capacity to devote any time to it myself. Thanks in advance for anyone who can step forward for this! Best wishes, Julian