Re: ORC file writing hangs in pyspark

Zhan Zhang Tue, 23 Feb 2016 18:29:07 -0800

Hi James,

You can try to write with other format, e.g., parquet to see whether it is a 
orc specific issue or more generic issue.


Thanks.

Zhan Zhang

On Feb 23, 2016, at 6:05 AM, James Barney 
<jamesbarne...@gmail.com<mailto:jamesbarne...@gmail.com>> wrote:

I'm trying to write an ORC file after running the FPGrowth algorithm on a 
dataset of around just 2GB in size. The algorithm performs well and can display 
results if I take(n) the freqItemSets() of the result after converting that to 
a DF.

I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn.

I get the results from querying a Hive table, also ORC format, running a number 
of maps, joins, and filters on the data.

When the program attempts to write the files:
    result.write.orc('/data/staged/raw_result')
  size_1_buckets.write.orc('/data/staged/size_1_results')
  filter_size_2_buckets.write.orc('/data/staged/size_2_results')

The first path, /data/staged/raw_result, is created with a _temporary folder, 
but the data is never written. The job hangs at this point, apparently 
indefinitely.

Additionally, no logs are recorded or available for the jobs on the history 
server.

What could be the problem?

Re: ORC file writing hangs in pyspark

Reply via email to