Re: Parquet Hive table become very slow on 1.3?

Cheng Lian Tue, 31 Mar 2015 00:22:24 -0700

Hi Xudong,

This is probably because of Parquet schema merging is turned on bydefault. This is generally useful for Parquet files with different butcompatible schemas. But it needs to read metadata from all Parquetpart-files. This can be problematic when reading Parquet files with lotsof part-files, especially when the user doesn't need schema merging.

This issue is tracked by SPARK-6575, and here is a PR for it:https://github.com/apache/spark/pull/5231. This PR adds a configurationto disable schema merging by default when doing Hive metastore Parquettable conversion.

Another workaround is to fallback to the old Parquet code by settingspark.sql.parquet.useDataSourceApi to false.


Cheng

On 3/31/15 2:47 PM, Zheng, Xudong wrote:

Hi all,
We are using Parquet Hive table, and we are upgrading to Spark 1.3.But we find that, just a simple COUNT(*) query will much slower (100x)than Spark 1.2.
I find the most time spent on driver to get HDFS blocks. I find largeamount of get below logs printed:
15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms
15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
   fileLength=77153436
   underConstruction=false
   blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; 
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010  
<http://10.152.116.172:50010>,10.152.116.169:50010  
<http://10.152.116.169:50010>, 10.153.125.184:50010]}]
   
lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; 
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.169:50010  
<http://10.152.116.169:50010>,10.153.125.184:50010  
<http://10.153.125.184:50010>,10.152.116.172:50010  <http://10.152.116.172:50010>]}
   isLastBlockComplete=true}
15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode10.152.116.172:50010  
<http://10.152.116.172:50010>
I compare the printed log with Spark 1.2, although the number ofgetBlockLocations call is similar, but each such operation only cost20~30 ms (but it is 2000ms~3000ms now), and it didn't print thedetailed LocatedBlocks info.
Another finding is, if I read the Parquet file via scala code formspark-shell as below, it looks fine, the computation will return theresult quick as before.
|sqlContext.parquetFile("data/myparquettable")|

Any idea about it? Thank you!


--
郑旭东
Zheng, Xudong

Re: Parquet Hive table become very slow on 1.3?

Reply via email to