Re: Keeping Hive in Sync

Vinoth Chandar Sat, 11 Jul 2020 09:32:07 -0700

(cc-ing users@ where we should start routing user support here on)

Sorry, we kind of dropped the ball here..  If you set the following config
to "true", then the data will be written under the following partition path
..


<basePath>/yyyy=2020/mm=07/dd=01 instead of simply <basePath>/2020/07/01

Then when you do spark.read.parquet() and have a predicate on yyyy, mm, dd
, spark should partition prune properly.. Let me know if you face issues..
Happy to work with you to get this resolved

/**
  * Flag to indicate whether to use Hive style partitioning.
  * If set true, the names of partition folders follow
<partition_column_name>=<partition_value> format.
  * By default false (the names of partition folders are only partition values)
  */
val HIVE_STYLE_PARTITIONING_OPT_KEY =
"hoodie.datasource.write.hive_style_partitioning"
val DEFAULT_HIVE_STYLE_PARTITIONING_OPT_VAL = "false"



On Wed, Jul 8, 2020 at 7:26 AM [email protected] <[email protected]>
wrote:

> I don't remember the root cause completely Vinoth. I guess it was due to
> some protocol mismatch.
> Balaji.V   On Tuesday, July 7, 2020, 10:25:48 PM PDT, Vinoth Chandar <
> [email protected]> wrote:
>
>  Hi,
>
> Yes. It can be an issue, probably good to get the table written using hive
> style partitioning. I will check  on this more and get back to you
>
> Balaji, do you know top of your head?
>
> Thanks
> Vinoth
>
> On Sat, Jul 4, 2020 at 11:22 PM selvaraj periyasamy <
> [email protected]> wrote:
>
> > Add some more info, my join condition would look for 180 days range
> > folders.
> >
> > On Sat, Jul 4, 2020 at 11:13 PM selvaraj periyasamy <
> > [email protected]> wrote:
> >
> > > Team,
> > >
> > > I have a question on keeping hive in sync.  Due to a shared Hadoop
> > > Environment restricting me from using hudi 0.5.1 or higher version i
> > ended
> > > up using 0.5.0.  Currently my hadoop cluster is having hive 1.2.x ,
> which
> > > is not supporting Hudi to keep hive in sync.
> > >
> > > So , I am not using the hive feature. I am reading it as below.
> > >
> > >
> > > sparkSession.
> > > read.
> > > format("org.apache.hudi").
> > > load("/projects/cdp/data/base/request_application/*/*").
> > > createOrReplaceTempView(s"base_request_application")
> > >
> > >
> > > I am going to store 3 years worth of data partitioned by day/hour.
> When I
> > > load 3 years data, I would have (3*365*24) = 26280 directories. Using
> the
> > > above approach and reading every time, I see all the directories names
> > are
> > > indexed. Would it impact the perfromance during joining with other
> table,
> > > if i dont use hive way of partition pruning?
> > >
> > > Thanks,
> > > Selva
> > >
> > >
> >
>

Re: Keeping Hive in Sync

Reply via email to