[
https://issues.apache.org/jira/browse/ARROW-15402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Narendran Raghavan updated ARROW-15402:
---------------------------------------
Component/s: (was: Python)
Affects Version/s: (was: 6.0.1)
Description: (was: I have few parquet files (each at least a
minimum of 15+ GB size) and I am reading them using
dask.dataframe.read_parquet() function to process them for NLP work. Each of
these parquet files have 5 columns all of which are strings and 2 of those
columns have large text strings.
When I try to process them using my company's internal distributed computing
library parabolt (which is just a layering on top of dask), I get a pyarrow
error (shown below). I'm not sure what this even means as my parquet file seems
to have 5 columns straightforward.
Example of my dask dataframe columns read from 1 of those parquet files:
!image-2022-01-20-16-11-51-385.png|width=497,height=197!
Need help to resolve this.
*Error Traceback:*
File "src/summary_stats/main.py", line 92, in <module>
ssp.run()
File "src/summary_stats/main.py", line 60, in run
task_obj["obj"].run(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 198,
in run
self.pp_data = self.pre_process_data(input_data)
File "/miniconda/lib/python3.7/site-packages/summary_stats/stats.py", line 177,
in pre_process_data
pre_processed_data =
input_data.vector_process(processing).compute_with_progress()
File "/miniconda/lib/python3.7/site-packages/parabolt/dsl_base.py", line 284,
in compute_with_progress
initial_wait=initial_wait,
File "/miniconda/lib/python3.7/site-packages/parabolt/tqdm.py", line 160, in
_observe
return future.result()
File "/miniconda/lib/python3.7/site-packages/distributed/client.py", line 238,
in result
raise exc.with_traceback(tb)
File "/miniconda/lib/python3.7/site-packages/dask/optimization.py", line 969,
in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/miniconda/lib/python3.7/site-packages/dask/core.py", line 119, in
_execute_task
return func(*(_execute_task(a, cache) for a in args))
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py",
line 94, in __call__
self.common_kwargs,
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py",
line 423, in read_parquet_part
for (rg, kw) in part
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py",
line 423, in <listcomp>
for (rg, kw) in part
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py",
line 434, in read_partition
**kwargs,
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py",
line 1558, in _read_table
**kwargs,
File
"/miniconda/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py",
line 234, in _read_table_from_path
use_pandas_metadata=True,
File "/miniconda/lib/python3.7/site-packages/pyarrow/parquet.py", line 384, in
read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1097, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented
for chunked array outputs
)
Summary: a (was: Running into Error:
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented
for chunked array outputs)
> a
> -
>
> Key: ARROW-15402
> URL: https://issues.apache.org/jira/browse/ARROW-15402
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Narendran Raghavan
> Priority: Critical
> Original Estimate: 336h
> Remaining Estimate: 336h
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)