hi Vaishal,

You can certainly use NumPy arrays to create Parquet files, but you
will have to do a bit of work to adapt the NumPy arrays to Parquet's
(and Arrow's) columnar data model. pandas DataFrame contains NumPy
arrays internally.

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

arr = np.random.randn(100)
a0 = pa.Array.from_pandas(arr)

t = pa.Table.from_arrays([a0], ['col0'])

pq.write_table(t, 'test.parquet')
returned_t = pq.read_table('test.parquet')

The function Array.from_pandas may be misleading -- it accepts any
Series or 1-dimensional ndarray according to pandas's NumPy-based
memory model (e.g. it will convert arrays of Python objects to various
Arrow types). We intend to make the pyarrow.array function a better
entry point for vanilla NumPy data that did not originate in pandas.

Patches to improve the API / user experience for standalone NumPy
users would be a great way to contribute to the project(s). See
ARROW-564, ARROW-838, ARROW-488. It would be very useful to be able to
construct Arrow arrays from a NumPy array plus a boolean mask for
nulls, for example.

Thanks
Wes

On Thu, Jun 1, 2017 at 4:34 AM, Shah, Vaishal <vaishal.s...@deshaw.com> wrote:
> This is Vaishal from D. E. Shaw and Co.
>
> We are interested to use py-arrow/parquet for one of our projects, that deals 
> with numpy arrays.
> Parquet provides API to store pandas dataframes on disk, but I could not find 
> any support for storing numpy arrays.
> Since numpy is a trivial form to store data, I was surprised to find no 
> function to store them in parquet format. Is there any way to store numpy 
> array in parquet format, that I probably missed?
> Or can we expect this support in newer version of parquet?
>
> Thanks,
> Vaishal
>

Reply via email to