Hi Matt, What I tend to do is partition by date in the following way:
s3://data-lake/pipeline1/extract_year=2018/extract_month=06/extract_day=20/file1.json See the pattern is key=value for physical partitions When you read that like this: spark.read.json("s3://data-lake/pipeline1/") It will bring you the data with the schema inferred from the JSON + 3 fields : extract_year, extract_month, extract_day I was looking for the documentation that described this way of partitioning but could not find it, I can reply with the link once I do Hope that helps, --- Daniel Mateus Pires Data Engineer Hudson's Bay Company On Wed, Jun 20, 2018 at 5:39 AM, mattl156 <matt.l...@gmail.com> wrote: > Hello, > > > > We have a number of Hive tables (non partitioned) that are populated with > subdirectories. (result of tez execution engine union queries) > > > > E.g. Table location: “s3://table1/” With the actual data residing in: > > > > s3://table1/1/data1 > > s3://table1/2/data2 > > s3://table1/3/data3 > > > > When using SparkSession (sql/hiveContext has the same behavior) and > spark.sql to query the data, no records are displayed due to these > subdirectories. > > > > e.g > > val df = spark.sql("select * from db.table1").show() > > > > I’ve tried a number of setConf properties e.g. > spark.hive.mapred.supports.subdirectories=true, > mapreduce.input.fileinputformat.input.dir.recursive=true but it does not > look like any of these properties are supported. > > > > Has anyone run into similar problems or ways to resolve it? Our current > alternatives are reading the input path directory directly e.g.: > > > > > spark.read.csv("s3://bucket-name/table1/bullseye_segments/*/*") > > > But this requires prior knowledge of the path or an extra step to determine > it. > > > Thanks, > > Matt > > > > > > -- > Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache- > 2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_&d=DwIFaQ&c=T4LuzJg_ > R6QwRnqJoo4xTCUXoKbdWTdhZj7r4OYEklY&r=jiwJwiY6eWNA- > UciMYI1Iw&m=SDQ6bALTDqRwG-TQZ1-xlEaHQDddyTn38FcaOmp8dDk&s= > ZvFFhUgiKT1JC1NMH6hI44Gx8pp3OwXHcrhbTUISvHg&e= > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >