Hello,
I'm just building arrow from source from a fresh checkout; commit:
326015cfc66e1f657cdd6811620137e9e277b43d
Everything seems to build against python 2.7:
$python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet
--with-plasma --inplace
{...}
Bundling includes: release/include
release/gandiva.so
Cython module gandiva failure permitted
('Moving generated C++ source', 'lib.cpp', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/lib.cpp')
('Moving built C-extension', 'release/lib.so', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/lib.so')
('Moving generated C++ source', '_csv.cpp', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/_csv.cpp')
('Moving built C-extension', 'release/_csv.so', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/_csv.so')
release/_cuda.so
Cython module _cuda failure permitted
('Moving generated C++ source', '_parquet.cpp', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/_parquet.cpp')
('Moving built C-extension', 'release/_parquet.so', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/_parquet.so')
release/_orc.so
Cython module _orc failure permitted
('Moving generated C++ source', '_plasma.cpp', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/_plasma.cpp')
('Moving built C-extension', 'release/_plasma.so', 'to build path',
'/home/apalumbo/repos/arrow/python/pyarrow/_plasma.so')
{...}
running tests though I get:
$ py.test pyarrow
ImportError while loading conftest
'/home/apalumbo/repos/arrow/python/pyarrow/tests/conftest.py'.
../../pyarrow/lib/python2.7/site-packages/six.py:709: in exec_
exec("""exec _code_ in _globs_, _locs_""")
pyarrow/tests/conftest.py:20: in <module>
import hypothesis as h
E ImportError: No module named hypothesis
after a pip install of `hypothesis` in my venv, (Python 2.7) I am able to run
the tests.
Several fail right off the bat (seems like many of the errors are
Pandas-related (see bottom for stack trace):
Switching to a virtualenv Running Python 3.5, the build fails:
$make -j4
{...}
make[2]: ***
[src/arrow/python/CMakeFiles/arrow_python_objlib.dir/benchmark.cc.o] Error 1
CMakeFiles/Makefile2:1862: recipe for target
'src/arrow/python/CMakeFiles/arrow_python_objlib.dir/all' failed
make[1]: *** [src/arrow/python/CMakeFiles/arrow_python_objlib.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
-- glog_ep install command succeeded. See also
/home/apalumbo/repos/arrow/cpp/build/glog_ep-prefix/src/glog_ep-stamp/glog_ep-install-*.log
[ 40%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/common.cc.o
[ 40%] Completed 'glog_ep'
[ 40%] Built target glog_ep
[ 41%] Building CXX object
src/plasma/CMakeFiles/plasma_objlib.dir/eviction_policy.cc.o
[ 41%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/events.cc.o
[ 42%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/fling.cc.o
[ 42%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/io.cc.o
[ 43%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/malloc.cc.o
[ 43%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/plasma.cc.o
[ 44%] Building CXX object src/plasma/CMakeFiles/plasma_objlib.dir/protocol.cc.o
[ 44%] Building C object
src/plasma/CMakeFiles/plasma_objlib.dir/thirdparty/ae/ae.c.o
[ 44%] Built target plasma_objlib
-- jemalloc_ep build command succeeded. See also
/home/apalumbo/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-build-*.log
[ 45%] Performing install step for 'jemalloc_ep'
-- jemalloc_ep install command succeeded. See also
/home/apalumbo/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep-stamp/jemalloc_ep-install-*.log
[ 45%] Completed 'jemalloc_ep'
[ 45%] Built target jemalloc_ep
Makefile:138: recipe for target 'all' failed
make: *** [all] Error 2
Any thoughts? I', building with the instructions from
https://arrow.apache.org/docs/python/development.html#development
Thanks in advance,
Andy
Partial stack trace (python 2.7) :
$py.test pyarrow
{...}
[5000 rows x 1 columns]
schema = None, preserve_index = False, nthreads = 16, columns = None, safe =
True
def dataframe_to_arrays(df, schema, preserve_index, nthreads=1,
columns=None,
safe=True):
names, column_names, index_columns, index_column_names, \
columns_to_convert, convert_types = _get_columns_to_convert(
df, schema, preserve_index, columns
)
# NOTE(wesm): If nthreads=None, then we use a heuristic to decide
whether
# using a thread pool is worth it. Currently the heuristic is whether
the
# nrows > 100 * ncols.
if nthreads is None:
nrows, ncols = len(df), len(df.columns)
if nrows > ncols * 100:
nthreads = pa.cpu_count()
else:
nthreads = 1
def convert_column(col, ty):
try:
return pa.array(col, type=ty, from_pandas=True, safe=safe)
except (pa.ArrowInvalid,
pa.ArrowNotImplementedError,
pa.ArrowTypeError) as e:
e.args += ("Conversion failed for column {0!s} with type {1!s}"
.format(col.name, col.dtype),)
raise e
if nthreads == 1:
arrays = [convert_column(c, t)
for c, t in zip(columns_to_convert,
convert_types)]
else:
> from concurrent import futures
E ImportError: No module named concurrent
pyarrow/pandas_compat.py:430: ImportError
___________________________________________________ test_compress_decompress
___________________________________________________
def test_compress_decompress():
INPUT_SIZE = 10000
test_data = (np.random.randint(0, 255, size=INPUT_SIZE)
.astype(np.uint8)
.tostring())
test_buf = pa.py_buffer(test_data)
codecs = ['lz4', 'snappy', 'gzip', 'zstd', 'brotli']
for codec in codecs:
> compressed_buf = pa.compress(test_buf, codec=codec)
pyarrow/tests/test_io.py:508:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow/io.pxi:1340: in pyarrow.lib.compress
check_status(CCodec.Create(c_codec, &compressor))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> raise ArrowNotImplementedError(message)
E ArrowNotImplementedError: ZSTD codec support not built
pyarrow/error.pxi:89: ArrowNotImplementedError
_______________________________________________ test_compressed_roundtrip[zstd]
________________________________________________
compression = 'zstd'
@pytest.mark.parametrize("compression",
["bz2", "brotli", "gzip", "lz4", "zstd"])
def test_compressed_roundtrip(compression):
data = b"some test data\n" * 10 + b"eof\n"
raw = pa.BufferOutputStream()
try:
> with pa.CompressedOutputStream(raw, compression) as compressed:
pyarrow/tests/test_io.py:1045:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow/io.pxi:1149: in pyarrow.lib.CompressedOutputStream.__init__
self._init(stream, compression_type)
pyarrow/io.pxi:1162: in pyarrow.lib.CompressedOutputStream._init
_make_compressed_output_stream(stream.get_output_stream(),
pyarrow/io.pxi:1087: in pyarrow.lib._make_compressed_output_stream
check_status(CCodec.Create(compression_type, &codec))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> raise ArrowNotImplementedError(message)
E ArrowNotImplementedError: ZSTD codec support not built
pyarrow/error.pxi:89: ArrowNotImplementedError
__________________________________________
test_pandas_serialize_round_trip_nthreads
___________________________________________
def test_pandas_serialize_round_trip_nthreads():
index = pd.Index([1, 2, 3], name='my_index')
columns = ['foo', 'bar']
df = pd.DataFrame(
{'foo': [1.5, 1.6, 1.7], 'bar': list('abc')},
index=index, columns=columns
)
> _check_serialize_pandas_round_trip(df, use_threads=True)
pyarrow/tests/test_ipc.py:536:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow/tests/test_ipc.py:514: in _check_serialize_pandas_round_trip
buf = pa.serialize_pandas(df, nthreads=2 if use_threads else 1)
pyarrow/ipc.py:163: in serialize_pandas
preserve_index=preserve_index)
pyarrow/table.pxi:864: in pyarrow.lib.RecordBatch.from_pandas
names, arrays, metadata = pdcompat.dataframe_to_arrays(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
df = foo bar
my_index
1 1.5 a
2 1.6 b
3 1.7 c, schema = None
preserve_index = True, nthreads = 2, columns = None, safe = True
def dataframe_to_arrays(df, schema, preserve_index, nthreads=1,
columns=None,
safe=True):
names, column_names, index_columns, index_column_names, \
columns_to_convert, convert_types = _get_columns_to_convert(
df, schema, preserve_index, columns
)
# NOTE(wesm): If nthreads=None, then we use a heuristic to decide
whether
# using a thread pool is worth it. Currently the heuristic is whether
the
# nrows > 100 * ncols.
if nthreads is None:
nrows, ncols = len(df), len(df.columns)
if nrows > ncols * 100:
nthreads = pa.cpu_count()
else:
nthreads = 1
def convert_column(col, ty):
try:
return pa.array(col, type=ty, from_pandas=True, safe=safe)
except (pa.ArrowInvalid,
pa.ArrowNotImplementedError,
pa.ArrowTypeError) as e:
e.args += ("Conversion failed for column {0!s} with type {1!s}"
.format(col.name, col.dtype),)
raise e
if nthreads == 1:
arrays = [convert_column(c, t)
for c, t in zip(columns_to_convert,
convert_types)]
else:
> from concurrent import futures
E ImportError: No module named concurrent
pyarrow/pandas_compat.py:430: ImportError
======================================================= warnings summary
=======================================================
pyarrow/tests/test_convert_pandas.py::TestConvertMetadata::test_empty_list_metadata
/home/apalumbo/repos/pyarrow/lib/python2.7/site-packages/pandas/core/dtypes/missing.py:431:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning
False, but in future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
if left_value != right_value:
/home/apalumbo/repos/pyarrow/lib/python2.7/site-packages/pandas/core/dtypes/missing.py:431:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning
False, but in future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
if left_value != right_value:
/home/apalumbo/repos/pyarrow/lib/python2.7/site-packages/pandas/core/dtypes/missing.py:431:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning
False, but in future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
if left_value != right_value:
pyarrow/tests/test_convert_pandas.py::TestListTypes::test_column_of_lists_first_empty
/home/apalumbo/repos/pyarrow/lib/python2.7/site-packages/pandas/core/dtypes/missing.py:431:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning
False, but in future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
if left_value != right_value:
pyarrow/tests/test_convert_pandas.py::TestListTypes::test_empty_list_roundtrip
/home/apalumbo/repos/pyarrow/lib/python2.7/site-packages/pandas/core/dtypes/missing.py:431:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning
False, but in future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
if left_value != right_value:
/home/apalumbo/repos/pyarrow/lib/python2.7/site-packages/pandas/core/dtypes/missing.py:431:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning
False, but in future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
if left_value != right_value:
/home/apalumbo/repos/pyarrow/lib/python2.7/site-packages/pandas/core/dtypes/missing.py:431:
DeprecationWarning: The truth value of an empty array is ambiguous. Returning
False, but in future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
if left_value != right_value:
-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================== 45 failed, 997 passed, 194 skipped, 3 xfailed, 7
warnings in 33.14 seconds ==========================
(pyarrow)