Hi Xudong,

This is probably because of Parquet schema merging is turned on by default. This is generally useful for Parquet files with different but compatible schemas. But it needs to read metadata from all Parquet part-files. This can be problematic when reading Parquet files with lots of part-files, especially when the user doesn't need schema merging.

This issue is tracked by SPARK-6575, and here is a PR for it: https://github.com/apache/spark/pull/5231. This PR adds a configuration to disable schema merging by default when doing Hive metastore Parquet table conversion.

Another workaround is to fallback to the old Parquet code by setting spark.sql.parquet.useDataSourceApi to false.

Cheng

On 3/31/15 2:47 PM, Zheng, Xudong wrote:
Hi all,

We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we find that, just a simple COUNT(*) query will much slower (100x) than Spark 1.2.

I find the most time spent on driver to get HDFS blocks. I find large amount of get below logs printed:

15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms
15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
   fileLength=77153436
   underConstruction=false
   blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; 
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010  
<http://10.152.116.172:50010>,10.152.116.169:50010  
<http://10.152.116.169:50010>, 10.153.125.184:50010]}]
   
lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; 
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.169:50010  
<http://10.152.116.169:50010>,10.153.125.184:50010  
<http://10.153.125.184:50010>,10.152.116.172:50010  <http://10.152.116.172:50010>]}
   isLastBlockComplete=true}
15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode10.152.116.172:50010  
<http://10.152.116.172:50010>

I compare the printed log with Spark 1.2, although the number of getBlockLocations call is similar, but each such operation only cost 20~30 ms (but it is 2000ms~3000ms now), and it didn't print the detailed LocatedBlocks info.

Another finding is, if I read the Parquet file via scala code form spark-shell as below, it looks fine, the computation will return the result quick as before.

|sqlContext.parquetFile("data/myparquettable")|

Any idea about it? Thank you!


--
郑旭东
Zheng, Xudong


Reply via email to