[GitHub] spark pull request: [SPARK-5775] BugFix: GenericRow cannot be cast...

ayoub-benali Fri, 20 Feb 2015 07:44:09 -0800

Github user ayoub-benali commented on the pull request:

    https://github.com/apache/spark/pull/4697#issuecomment-75258468
  
    Just tried to reproduce the example in SPARK-5775 with the spark shell and 
now it hangs for ever during query time. 
    Maybe because the tests don't reproduce the same example as in the issue: 
array of struct.
    
    ```scala
    scala> hiveContext.sql("select data.field1 from test_table LATERAL VIEW 
explode(data_array) nestedStuff AS data").collect
    15/02/20 16:32:55 INFO ParseDriver: Parsing command: select data.field1 
from test_table LATERAL VIEW explode(data_array) nestedStuff AS data
    15/02/20 16:32:55 INFO ParseDriver: Parse Completed
    15/02/20 16:32:55 INFO MemoryStore: ensureFreeSpace(260309) called with 
curMem=97368, maxMem=280248975
    15/02/20 16:32:55 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 254.2 KB, free 266.9 MB)
    15/02/20 16:32:55 INFO MemoryStore: ensureFreeSpace(28517) called with 
curMem=357677, maxMem=280248975
    15/02/20 16:32:55 INFO MemoryStore: Block broadcast_2_piece0 stored as 
bytes in memory (estimated size 27.8 KB, free 266.9 MB)
    15/02/20 16:32:55 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
on ****:54658 (size: 27.8 KB, free: 267.2 MB)
    15/02/20 16:32:55 INFO BlockManagerMaster: Updated info of block 
broadcast_2_piece0
    15/02/20 16:32:55 INFO SparkContext: Created broadcast 2 from NewHadoopRDD 
at ParquetTableOperations.scala:119
    15/02/20 16:32:55 INFO FileInputFormat: Total input paths to process : 3
    15/02/20 16:32:55 INFO ParquetInputFormat: Total input paths to process : 3
    15/02/20 16:32:55 INFO ParquetFileReader: Initiating action with 
parallelism: 5
    15/02/20 16:32:55 INFO ParquetFileReader: reading summary file: 
hdfs://****:8020/path/test_table/date=2015-02-12/_metadata
    15/02/20 16:32:55 INFO ParquetFileReader: reading another 1 footers
    15/02/20 16:32:55 INFO ParquetFileReader: Initiating action with 
parallelism: 5
    15/02/20 16:32:55 INFO FilteringParquetRowInputFormat: Fetched 
[LocatedFileStatus{path=hdfs://****:8020/path/test_table/date=2015-02-12/part-r-1.parquet;
 isDirectory=false; length=463; replication=3; blocksize=134217728; 
modification_time=1424446345899; access_time=1424446344501; owner=rptn_deploy; 
group=supergroup; permission=rw-r--r--; isSymlink=false}, 
LocatedFileStatus{path=hdfs://****:8020/path/test_table/date=2015-02-12/part-r-2.parquet;
 isDirectory=false; length=731; replication=3; blocksize=134217728; 
modification_time=1424446346655; access_time=1424446345540; owner=rptn_deploy; 
group=supergroup; permission=rw-r--r--; isSymlink=false}, 
LocatedFileStatus{path=hdfs://****:8020/path/test_table/date=2015-02-12/part-r-3.parquet;
 isDirectory=false; length=727; replication=3; blocksize=134217728; 
modification_time=1424446346773; access_time=1424446345628; owner=rptn_deploy; 
group=supergroup; permission=rw-r--r--; isSymlink=false}] footers in 31 ms
    15/02/20 16:32:55 INFO deprecation: mapred.max.split.size is deprecated. 
Instead, use mapreduce.input.fileinputformat.split.maxsize
    15/02/20 16:32:55 INFO deprecation: mapred.min.split.size is deprecated. 
Instead, use mapreduce.input.fileinputformat.split.minsize
    15/02/20 16:32:55 INFO FilteringParquetRowInputFormat: Using Task Side 
Metadata Split Strategy
    15/02/20 16:32:55 INFO SparkContext: Starting job: collect at 
SparkPlan.scala:84
    15/02/20 16:32:55 INFO DAGScheduler: Got job 2 (collect at 
SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
    15/02/20 16:32:55 INFO DAGScheduler: Final stage: Stage 2(collect at 
SparkPlan.scala:84)
    15/02/20 16:32:55 INFO DAGScheduler: Parents of final stage: List()
    15/02/20 16:32:55 INFO DAGScheduler: Missing parents: List()
    15/02/20 16:32:55 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[26] at 
map at SparkPlan.scala:84), which has no missing parents
    15/02/20 16:32:56 INFO MemoryStore: ensureFreeSpace(7616) called with 
curMem=386194, maxMem=280248975
    15/02/20 16:32:56 INFO MemoryStore: Block broadcast_3 stored as values in 
memory (estimated size 7.4 KB, free 266.9 MB)
    15/02/20 16:32:56 INFO MemoryStore: ensureFreeSpace(4225) called with 
curMem=393810, maxMem=280248975
    15/02/20 16:32:56 INFO MemoryStore: Block broadcast_3_piece0 stored as 
bytes in memory (estimated size 4.1 KB, free 266.9 MB)
    15/02/20 16:32:56 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
on ****:54658 (size: 4.1 KB, free: 267.2 MB)
    15/02/20 16:32:56 INFO BlockManagerMaster: Updated info of block 
broadcast_3_piece0
    15/02/20 16:32:56 INFO SparkContext: Created broadcast 3 from broadcast at 
DAGScheduler.scala:838
    15/02/20 16:32:56 INFO DAGScheduler: Submitting 3 missing tasks from Stage 
2 (MappedRDD[26] at map at SparkPlan.scala:84)
    15/02/20 16:32:56 INFO TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
    15/02/20 16:32:56 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 
6, ****, NODE_LOCAL, 1639 bytes)
    15/02/20 16:32:56 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 
7, ****, NODE_LOCAL, 1638 bytes)
    15/02/20 16:32:56 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 
8, ****, NODE_LOCAL, 1639 bytes)
    15/02/20 16:32:56 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
on ****:45208 (size: 4.1 KB, free: 133.6 MB)
    15/02/20 16:32:56 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
on ****:52420 (size: 4.1 KB, free: 133.6 MB)
    15/02/20 16:32:56 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
on ****:43309 (size: 4.1 KB, free: 133.6 MB)
    15/02/20 16:32:56 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
on ****:43309 (size: 27.8 KB, free: 133.6 MB)
    15/02/20 16:32:56 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
on ****:52420 (size: 27.8 KB, free: 133.6 MB)
    15/02/20 16:32:56 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
on ****:45208 (size: 27.8 KB, free: 133.6 MB)
    15/02/20 16:32:56 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 
8) in 490 ms on **** (1/3)
    15/02/20 16:36:01 INFO BlockManager: Removing broadcast 1
    15/02/20 16:36:01 INFO BlockManager: Removing block broadcast_1_piece0
    15/02/20 16:36:01 INFO MemoryStore: Block broadcast_1_piece0 of size 31176 
dropped from memory (free 279882116)
    15/02/20 16:36:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
****:54658 in memory (size: 30.4 KB, free: 267.2 MB)
    15/02/20 16:36:01 INFO BlockManagerMaster: Updated info of block 
broadcast_1_piece0
    15/02/20 16:36:01 INFO BlockManager: Removing block broadcast_1
    15/02/20 16:36:01 INFO MemoryStore: Block broadcast_1 of size 66192 dropped 
from memory (free 279948308)
    15/02/20 16:36:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
****:52420 in memory (size: 30.4 KB, free: 133.6 MB)
    15/02/20 16:36:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
****:45208 in memory (size: 30.4 KB, free: 133.6 MB)
    15/02/20 16:36:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
****:43309 in memory (size: 30.4 KB, free: 133.6 MB)
    15/02/20 16:36:01 INFO ContextCleaner: Cleaned broadcast 1
    
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5775] BugFix: GenericRow cannot be cast...

Reply via email to