Xudong and Rex,
Can you try 1.3.1? With PR 5339 http://github.com/apache/spark/pull/5339 ,
after we get a hive parquet from metastore and convert it to our native
parquet code path, we will cache the converted relation. For now, the first
access to that hive parquet table reads all of the footers
Yin,
Thanks for your reply.
We already patched this PR to our 1.3.0
As Xudong mentioned, we have thousand of parquet files, it's very very slow
in first read, and another app will add more files and refresh table
regularly.
Cheng Lian's PR-5334 seems can resolve this issue, it will skip read all
We have the similar issue with massive parquet files, Cheng Lian, could you
have a look?
2015-04-08 15:47 GMT+08:00 Zheng, Xudong dong...@gmail.com:
Hi Cheng,
I tried both these patches, and seems still not resolve my issue. And I
found the most time is spend on this line in
Hi Cheng,
I tried both these patches, and seems still not resolve my issue. And I
found the most time is spend on this line in newParquet.scala:
ParquetFileReader.readAllFootersInParallel(
sparkContext.hadoopConfiguration, seqAsJavaList(leaves), taskSideMetaData)
Which need read all the files
Hey Xudong,
We had been digging this issue for a while, and believe PR 5339
http://github.com/apache/spark/pull/5339 and PR 5334
http://github.com/apache/spark/pull/5339 should fix this issue.
There two problems:
1. Normally we cache Parquet table metadata for better performance, but
when
Hi all,
We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we
find that, just a simple COUNT(*) query will much slower (100x) than Spark
1.2.
I find the most time spent on driver to get HDFS blocks. I find large
amount of get below logs printed:
15/03/30 23:03:43 DEBUG