Hi all,

[NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to
set to d-science and the RFP bug only]

An update on Apache Arrow, and in particular the Python library
PyArrow.  For those who don't know:

  Apache Arrow is a development platform for in-memory analytics. It
  contains a set of technologies that enable big data systems to
  process and move data fast. It specifies a standardized
  language-independent columnar memory format for flat and
  hierarchical data, organized for efficient analytic operations on
  modern hardware.

  The project is developing a multi-language collection of libraries
  for solving systems problems related to in-memory analytical data
  processing. This includes such topics as:

  * Zero-copy shared memory and RPC-based data movement

  * Reading and writing file formats (like CSV, Apache ORC, and Apache
    Parquet)

  * In-memory analytics and query processing

  (from: https://arrow.apache.org/docs/index.html)

Pandas has announced that Pandas 3.x will depend on PyArrow
in a critical way (it will back the "string" datatype), and it is due
to be released imminently.

So this is a plea for anyone looking for something really helpful to
do: it would be great to have a group of developers finally package
this!  There was some initial work done (see the RFP bug report for
details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
but that is fairly old now.  As Apache Arrow supports numerous
languages, it may well benefit from having a group of developers with
different areas of expertise to build it.  (Or perhaps it would make
more sense to split the upstream source into a collection of different
Debian source packages for the different supported languages.  I don't
know.)  Unfortunately I don't have the capacity to devote any time to
it myself.

Thanks in advance for anyone who can step forward for this!

Best wishes,

   Julian

Reply via email to