[GitHub] jkhalid commented on issue #5400: [SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit

GitBox Tue, 12 Feb 2019 09:12:11 -0800

jkhalid commented on issue #5400: [SPARK-6190][core] create LargeByteBuffer for 
eliminating 2GB block limit
URL: https://github.com/apache/spark/pull/5400#issuecomment-462847339
 
 
   @squito @SparkQA @vanzin @shaneknapp @tgravescs 
   
   I am using spark.sql on AWS Glue to generate a single large (it is the 
clients requirement to have a single file)  csv compressed file which is 
greater than 2GB for sure. I am running to into this issue 
   
   write(transformed_feed)
   File "script_2019-02-12-15-57-55.py", line 161, in write
   output_path_premium, header=True, compression="gzip")
   File 
"/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py",
 line 766, in csv
   File 
"/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
   File 
"/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
   File 
"/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py",
 line 319, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o210.csv.
   .
   .
   .
   .
   .
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 1.0 (TID 4, ip-172-32-189-222.ec2.internal, executor 1): 
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   
   Below is python code used to write the file
   def write(dataframe):
       # write two files premium listings and non premium listings (critera : 
listing_priority > 30 = premium)
       dataframe.filter(dataframe["listing_priority"] >= 
30).drop('listing_priority').drop('image_count').write.csv(
           output_path_premium, header=True, compression="gzip")
       shell_command = "hdfs dfs -mv " + output_path_premium + '/part-*' + '  ' 
+ output_path_premium + output_file_premium
       os.system(shell_command)
       dataframe.filter(dataframe["listing_priority"] < 
30).drop('listing_priority').drop('image_count').write.csv(
           output_path_nonpremium, header=True, compression="gzip")
       shell_command = "hdfs dfs -mv " + output_path_nonpremium + '/part-*' + ' 
 ' + output_path_nonpremium + output_file_nonpremium
       os.system(shell_command)
   
   I am assuming its because the file is greater than 2GB . has this issue been 
fixed ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] jkhalid commented on issue #5400: [SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit

Reply via email to