Hi Matt,

What I tend to do is partition by date in the following way:

s3://data-lake/pipeline1/extract_year=2018/extract_month=06/extract_day=20/file1.json


See the pattern is key=value for physical partitions

When you read that like this:
spark.read.json("s3://data-lake/pipeline1/")

It will bring you the data with the schema inferred from the JSON + 3
fields : extract_year, extract_month, extract_day

I was looking for the documentation that described this way of partitioning
but could not find it, I can reply with the link once I do

Hope that helps,
---
Daniel Mateus Pires
Data Engineer
Hudson's Bay Company

On Wed, Jun 20, 2018 at 5:39 AM, mattl156 <matt.l...@gmail.com> wrote:

> Hello,
>
>
>
> We have a number of Hive tables (non partitioned) that are populated with
> subdirectories. (result of tez execution engine union queries)
>
>
>
> E.g. Table location: “s3://table1/” With the actual data residing in:
>
>
>
> s3://table1/1/data1
>
> s3://table1/2/data2
>
> s3://table1/3/data3
>
>
>
> When using SparkSession (sql/hiveContext has the same behavior) and
> spark.sql to query the data, no records are displayed due to these
> subdirectories.
>
>
>
> e.g
>
> val df = spark.sql("select * from db.table1").show()
>
>
>
> I’ve tried a number of setConf properties e.g.
> spark.hive.mapred.supports.subdirectories=true,
> mapreduce.input.fileinputformat.input.dir.recursive=true but it does not
> look like any of these properties are supported.
>
>
>
> Has anyone run into similar problems or ways to resolve it? Our current
> alternatives are reading the input path directory directly e.g.:
>
>
>
>
> spark.read.csv("s3://bucket-name/table1/bullseye_segments/*/*")
>
>
> But this requires prior knowledge of the path or an extra step to determine
> it.
>
>
> Thanks,
>
> Matt
>
>
>
>
>
> --
> Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-
> 2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_&d=DwIFaQ&c=T4LuzJg_
> R6QwRnqJoo4xTCUXoKbdWTdhZj7r4OYEklY&r=jiwJwiY6eWNA-
> UciMYI1Iw&m=SDQ6bALTDqRwG-TQZ1-xlEaHQDddyTn38FcaOmp8dDk&s=
> ZvFFhUgiKT1JC1NMH6hI44Gx8pp3OwXHcrhbTUISvHg&e=
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to