Thanks for coming back with the solution!
Sorry my suggestion did not help
Daniel
On Wed, 20 Jun 2018, 21:46 mattl156, wrote:
> Alright so I figured it out.
>
> When reading from and writing to Hive metastore Parquet tables, Spark SQL
> will try to use its own Parquet support instead of Hive
Alright so I figured it out.
When reading from and writing to Hive metastore Parquet tables, Spark SQL
will try to use its own Parquet support instead of Hive SerDe for better
performance.
And so setting things like the below have no impact.
Thanks. Unfortunately I dont have control over how data is inserted and the
table is not partitioned.
The reason the sub directories are being created is because when Tez does an
INSERT into a table from a UNION query it creates sub directories so that it
can write in parallel.
I've also
Hi Matt,
What I tend to do is partition by date in the following way:
s3://data-lake/pipeline1/extract_year=2018/extract_month=06/extract_day=20/file1.json
See the pattern is key=value for physical partitions
When you read that like this:
spark.read.json("s3://data-lake/pipeline1/")
It will
Hello,
We have a number of Hive tables (non partitioned) that are populated with
subdirectories. (result of tez execution engine union queries)
E.g. Table location: “s3://table1/” With the actual data residing in:
s3://table1/1/data1
s3://table1/2/data2
s3://table1/3/data3
When