Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
Thanks for coming back with the solution! Sorry my suggestion did not help Daniel On Wed, 20 Jun 2018, 21:46 mattl156, wrote: > Alright so I figured it out. > > When reading from and writing to Hive metastore Parquet tables, Spark SQL > will try to use its own Parquet support instead of Hive

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread mattl156
Alright so I figured it out. When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. And so setting things like the below have no impact.

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread mattl156
Thanks. Unfortunately I dont have control over how data is inserted and the table is not partitioned. The reason the sub directories are being created is because when Tez does an INSERT into a table from a UNION query it creates sub directories so that it can write in parallel. I've also

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
Hi Matt, What I tend to do is partition by date in the following way: s3://data-lake/pipeline1/extract_year=2018/extract_month=06/extract_day=20/file1.json See the pattern is key=value for physical partitions When you read that like this: spark.read.json("s3://data-lake/pipeline1/") It will

[Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-19 Thread mattl156
Hello, We have a number of Hive tables (non partitioned) that are populated with subdirectories. (result of tez execution engine union queries) E.g. Table location: “s3://table1/” With the actual data residing in: s3://table1/1/data1 s3://table1/2/data2 s3://table1/3/data3 When