---------- Forwarded message --------- From: Hao Ren <inv...@gmail.com> Date: Thu, Aug 8, 2019 at 4:15 PM Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive table To: Gourav Sengupta <gourav.sengu...@gmail.com>
Hi Gourva, I am using enableHiveSupport. The table was not created by Spark. The table already exists in Hive. All I did is just reading it by using SQL query in Spark. FYI, I put hive-site.xml in spark/conf/ directory to make sure that Spark can access to Hive. Hao On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > Just out of curiosity did you start the SPARK session using > enableHiveSupport() ? > > Or are you creating the table using SPARK? > > > Regards, > Gourav > > On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <inv...@gmail.com> wrote: > >> Hi, >> I am using Spark SQL 2.3.3 to read a hive table which is partitioned by >> day, hour, platform, request_status and is_sampled. The underlying data is >> in parquet format on HDFS. >> Here is the SQL query to read just *one partition*. >> >> ``` >> spark.sql(""" >> SELECT rtb_platform_id, SUM(e_cpm) >> FROM raw_logs.fact_request >> WHERE day = '2019-08-01' >> AND hour = '00' >> AND platform = 'US' >> AND request_status = '3' >> AND is_sampled = 1 >> GROUP BY rtb_platform_id >> """).show >> ``` >> >> However, from the Spark web UI, the stage description shows: >> >> ``` >> Listing leaf files and directories for 201616 paths: >> viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, >> ... >> ``` >> >> It seems the job is reading all of the partitions of the table and the >> job takes too long for just one partition. One workaround is using >> `spark.read.parquet` API to read parquet files directly. Spark has >> partition-awareness for partitioned directories. >> >> But still, I would like to know if there is a way to leverage >> partition-awareness via Hive by using `spark.sql` API? >> >> Any help is highly appreciated! >> >> Thank you. >> >> -- >> Hao Ren >> > -- Hao Ren Software Engineer in Machine Learning @ Criteo Paris, France -- Hao Ren Software Engineer in Machine Learning @ Criteo Paris, France