Matthieu Vanhoutte created SPARK-37882:
------------------------------------------
Summary: pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional
array values
Key: SPARK-37882
URL: https://issues.apache.org/jira/browse/SPARK-37882
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 3.2.0
Environment: Ubuntu 18.04
Reporter: Matthieu Vanhoutte
Hello,
When trying to convert a pandas dataframe
{code:java}
ss_corpus_dataframe{code}
(containing one column with two-dimensional numpy array) into a
pandas-on-spark dataframe with the following code:
{code:java}
df = ps.from_pandas(ss_corpus_dataframe){code}
I got the following error:
{code:java}
Traceback (most recent call last):
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py",
line 375, in run_asgi
result = await app(self.scope, self.receive, self.send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py",
line 75, in __call__
return await self.app(scope, receive, send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/middleware/message_logger.py",
line 82, in __call__
raise exc from None
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/uvicorn/middleware/message_logger.py",
line 78, in __call__
await self.app(scope, inner_receive, inner_send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/fastapi/applications.py",
line 208, in __call__
await super().__call__(scope, receive, send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/applications.py",
line 112, in __call__
await self.middleware_stack(scope, receive, send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/middleware/errors.py",
line 181, in __call__
raise exc
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/middleware/errors.py",
line 159, in __call__
await self.app(scope, receive, _send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/exceptions.py",
line 82, in __call__
raise exc
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/exceptions.py",
line 71, in __call__
await self.app(scope, receive, sender)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/routing.py",
line 656, in __call__
await route.handle(scope, receive, send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/routing.py",
line 259, in handle
await self.app(scope, receive, send)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/starlette/routing.py",
line 61, in app
response = await func(request)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/fastapi/routing.py",
line 226, in app
raw_response = await run_endpoint_function(
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/fastapi/routing.py",
line 159, in run_endpoint_function
return await dependant.call(**values)
File "./app/routers/semantic_searches.py", line 60, in create_semantic_search
date_time_sem_search, clean_query, output_dict, error_code = await
apply_semantic_search_async(query=query,
api_sent_embed_url=settings.api_sent_embed_address,
ss_corpus_dataframe=ss_corpus_dataframe.dataframe, id_matrices=id_matrices,
top_k=75, similarity_score_thresh=0.5)
File "./app/backend/semantic_search/sts_tf_semantic_search.py", line 134, in
apply_semantic_search_async
df = ps.from_pandas(ss_corpus_dataframe)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/pyspark/pandas/namespace.py",
line 143, in from_pandas
return DataFrame(pobj)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/pyspark/pandas/frame.py",
line 520, in __init__
internal = InternalFrame.from_pandas(pdf)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/pyspark/pandas/internal.py",
line 1460, in from_pandas
) = InternalFrame.prepare_pandas_frame(pdf)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/pyspark/pandas/internal.py",
line 1533, in prepare_pandas_frame
spark_type = infer_pd_series_spark_type(reset_index[col], dtype)
File
"/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/pyspark/pandas/typedef/typehints.py",
line 329, in infer_pd_series_spark_type
return from_arrow_type(pa.Array.from_pandas(pser).type)
File "pyarrow/array.pxi", line 904, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values{code}
Could it be possible to add the possibility to convert multi-dimensional array
values from pandas to pandas-on-spark?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]