Narendran Raghavan created ARROW-15402:
------------------------------------------
Summary: Running into Error: pyarrow.lib.ArrowNotImplementedError:
Nested data conversions not implemented for chunked array outputs
Key: ARROW-15402
URL: https://issues.apache.org/jira/browse/ARROW-15402
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 6.0.1
Reporter: Narendran Raghavan
Attachments: image-2022-01-20-16-11-51-385.png
I have few parquet files (each at least a minimum of 15+ GB size) and I am
reading them using dask.dataframe.read_parquet() function to process them for
NLP work. Each of these parquet files have 5 columns all of which are strings
and 2 of those columns have large text strings.
When I try to process them using my company's internal distributed computing
library parabolt (which is just a layering on top of dask), I get a pyarrow
error (shown below). I'm not sure what this even means as my parquet file seems
to have 5 columns straightforward.
Example of my dask dataframe columns read from 1 of those parquet files:
!image-2022-01-20-16-11-51-385.png|width=497,height=197!
Need help to resolve this.
*Error Traceback:*
File "src/summary_stats/main.py", line 92, in <module>
ssp.run()
File "src/summary_stats/main.py", line 60, in run
task_obj["obj"].run(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 198,
in run
self.pp_data = self.pre_process_data(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 177,
in pre_process_data
pre_processed_data =
input_data.vector_process(processing).compute_with_progress()
File "/miniconda/lib/python3.7/site-packages/parabolt/dsl_base.py", line 284,
in compute_with_progress
initial_wait=initial_wait,
File "/miniconda/lib/python3.7/site-packages/parabolt/tqdm.py", line 160, in
_observe
return future.result()
File "/miniconda/lib/python3.7/site-packages/distributed/client.py", line 238,
in result
raise exc.with_traceback(tb)
File "/miniconda/lib/python3.7/site-packages/dask/optimization.py", line 969,
in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 119, in
_execute_task
return func(*(_execute_task(a, cache) for a in args))
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py",
line 94, in __call__
self.common_kwargs,
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py",
line 423, in read_parquet_part
for (rg, kw) in part
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py",
line 423, in <listcomp>
for (rg, kw) in part
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py",
line 434, in read_partition
**kwargs,
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py",
line 1558, in _read_table
**kwargs,
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py",
line 234, in _read_table_from_path
use_pandas_metadata=True,
File "/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py", line 384, in
read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1097, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented
for chunked array outputs
--
This message was sent by Atlassian Jira
(v8.20.1#820001)