Narendran Raghavan created ARROW-15402:
------------------------------------------

             Summary: Running into Error: pyarrow.lib.ArrowNotImplementedError: 
Nested data conversions not implemented for chunked array outputs
                 Key: ARROW-15402
                 URL: https://issues.apache.org/jira/browse/ARROW-15402
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 6.0.1
            Reporter: Narendran Raghavan
         Attachments: image-2022-01-20-16-11-51-385.png

I have few parquet files (each at least a minimum of 15+ GB size) and I am 
reading them using dask.dataframe.read_parquet() function to process them for 
NLP work. Each of these parquet files have 5 columns all of which are strings 
and 2 of those columns have large text strings.

When I try to process them using my company's internal distributed computing 
library parabolt (which is just a layering on top of dask), I get a pyarrow 
error (shown below). I'm not sure what this even means as my parquet file seems 
to have 5 columns straightforward.

Example of my dask dataframe columns read from 1 of those parquet files:

!image-2022-01-20-16-11-51-385.png|width=497,height=197!

Need help to resolve this.

 

*Error Traceback:*
File "src/summary_stats/main.py", line 92, in <module>
ssp.run()
File "src/summary_stats/main.py", line 60, in run
task_obj["obj"].run(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 198, 
in run
self.pp_data = self.pre_process_data(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 177, 
in pre_process_data
pre_processed_data = 
input_data.vector_process(processing).compute_with_progress()
File "/miniconda/lib/python3.7/site-packages/parabolt/dsl_base.py", line 284, 
in compute_with_progress
initial_wait=initial_wait,
File "/miniconda/lib/python3.7/site-packages/parabolt/tqdm.py", line 160, in 
_observe
return future.result()
File "/miniconda/lib/python3.7/site-packages/distributed/client.py", line 238, 
in result
raise exc.with_traceback(tb)
File "/miniconda/lib/python3.7/site-packages/dask/optimization.py", line 969, 
in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 119, in 
_execute_task
return func(*(_execute_task(a, cache) for a in args))
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", 
line 94, in __call__
self.common_kwargs,
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", 
line 423, in read_parquet_part
for (rg, kw) in part
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", 
line 423, in <listcomp>
for (rg, kw) in part
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", 
line 434, in read_partition
**kwargs,
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", 
line 1558, in _read_table
**kwargs,

File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", 
line 234, in _read_table_from_path
use_pandas_metadata=True,
File "/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py", line 384, in 
read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
for chunked array outputs
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to