[ 
https://issues.apache.org/jira/browse/ARROW-15402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narendran Raghavan updated ARROW-15402:
---------------------------------------
          Component/s:     (was: Python)
    Affects Version/s:     (was: 6.0.1)
          Description:     (was: I have few parquet files (each at least a 
minimum of 15+ GB size) and I am reading them using 
dask.dataframe.read_parquet() function to process them for NLP work. Each of 
these parquet files have 5 columns all of which are strings and 2 of those 
columns have large text strings.

When I try to process them using my company's internal distributed computing 
library parabolt (which is just a layering on top of dask), I get a pyarrow 
error (shown below). I'm not sure what this even means as my parquet file seems 
to have 5 columns straightforward.

Example of my dask dataframe columns read from 1 of those parquet files:

!image-2022-01-20-16-11-51-385.png|width=497,height=197!

Need help to resolve this.

 

*Error Traceback:*
File "src/summary_stats/main.py", line 92, in <module>
ssp.run()
File "src/summary_stats/main.py", line 60, in run
task_obj["obj"].run(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 198, 
in run
self.pp_data = self.pre_process_data(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 177, 
in pre_process_data
pre_processed_data = 
input_data.vector_process(processing).compute_with_progress()
File "/miniconda/lib/python3.7/site-packages/parabolt/dsl_base.py", line 284, 
in compute_with_progress
initial_wait=initial_wait,
File "/miniconda/lib/python3.7/site-packages/parabolt/tqdm.py", line 160, in 
_observe
return future.result()
File "/miniconda/lib/python3.7/site-packages/distributed/client.py", line 238, 
in result
raise exc.with_traceback(tb)
File "/miniconda/lib/python3.7/site-packages/dask/optimization.py", line 969, 
in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 119, in 
_execute_task
return func(*(_execute_task(a, cache) for a in args))
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", 
line 94, in __call__
self.common_kwargs,
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", 
line 423, in read_parquet_part
for (rg, kw) in part
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", 
line 423, in <listcomp>
for (rg, kw) in part
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", 
line 434, in read_partition
**kwargs,
File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", 
line 1558, in _read_table
**kwargs,

File 
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", 
line 234, in _read_table_from_path
use_pandas_metadata=True,
File "/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py", line 384, in 
read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1097, in 
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
for chunked array outputs
 )
              Summary: a  (was: Running into Error: 
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
for chunked array outputs)

> a
> -
>
>                 Key: ARROW-15402
>                 URL: https://issues.apache.org/jira/browse/ARROW-15402
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Narendran Raghavan
>            Priority: Critical
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to