jkhalid commented on issue #5400: [SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit URL: https://github.com/apache/spark/pull/5400#issuecomment-462847339 @squito @SparkQA @vanzin @shaneknapp @tgravescs I am using spark.sql on AWS Glue to generate a single large (it is the clients requirement to have a single file) csv compressed file which is greater than 2GB for sure. I am running to into this issue write(transformed_feed) File "script_2019-02-12-15-57-55.py", line 161, in write output_path_premium, header=True, compression="gzip") File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 766, in csv File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o210.csv. . . . . . at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-32-189-222.ec2.internal, executor 1): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE Below is python code used to write the file def write(dataframe): # write two files premium listings and non premium listings (critera : listing_priority > 30 = premium) dataframe.filter(dataframe["listing_priority"] >= 30).drop('listing_priority').drop('image_count').write.csv( output_path_premium, header=True, compression="gzip") shell_command = "hdfs dfs -mv " + output_path_premium + '/part-*' + ' ' + output_path_premium + output_file_premium os.system(shell_command) dataframe.filter(dataframe["listing_priority"] < 30).drop('listing_priority').drop('image_count').write.csv( output_path_nonpremium, header=True, compression="gzip") shell_command = "hdfs dfs -mv " + output_path_nonpremium + '/part-*' + ' ' + output_path_nonpremium + output_file_nonpremium os.system(shell_command) I am assuming its because the file is greater than 2GB . has this issue been fixed ?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
