jonbelanger-ns edited a comment on issue #20826: [SPARK-2489][SQL] Support Parquet's optional fixed_len_byte_array URL: https://github.com/apache/spark/pull/20826#issuecomment-575728511 If it helps, I have a fairly complex parquet file with a few nested fields as FIXED_LEN_BYTE_ARRAY, so this bug is a show stopper for spark on this dataset. I tried to fix by cloning this repo with the PR (https://github.com/aws-awinstan/spark.git) to local machine and compiling. I did the same for the master repo for spark which worked fine on a with a few of the columns (to test without parsing the FIXED_LEN_BYTE_ARRAY columns). However, the aws-awinstan repo fails with on the same test columns: [Stage 0:> (0 + 1) / 1]20/01/17 12:37:13 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.42.107, executor 0): java.io.StreamCorruptedException: invalid stream header: 0000000F at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:866) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:63) at org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:63) at org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:126) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:113) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) I'm using in the following in my client environment, with the HDFS and Spark remote in VM and standalone with a single worker. $ pip freeze | grep spark pyspark==2.4.4 spark==0.2.1 I'm surprised this bug was allowed to languish for as long as it has, it's not possible for us to serialize the upstream data and need this feature or have to move on...
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
