Re: Spark v2.3.2 : Duplicate entries found for each primary Key

Purushotham Pushpavanthar Sat, 16 Nov 2019 10:16:39 -0800

Thanks Vinoth and Kabeer. It resolved my problem.

Regards,
Purushotham Pushpavanth




On Fri, 15 Nov 2019 at 20:16, Kabeer Ahmed <[email protected]> wrote:

> Adding to Vinoth's response, in spark-shell you just need to copy and
> paste the below line. Let us know if it still doesnt work.
>
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
> classOf[org.apache.hadoop.fs.PathFilter]);
> On Nov 15 2019, at 1:37 pm, Vinoth Chandar <[email protected]> wrote:
> > Hi,
> >
> > are you setting the path filters when you query the Hudi Hive table via
> > Spark
> > http://hudi.apache.org/querying_data.html#spark-ro-view (or
> > http://hudi.apache.org/querying_data.html#spark-rt-view alternatively)?
> >
> > - Vinoth
> > On Fri, Nov 15, 2019 at 5:03 AM Purushotham Pushpavanthar <
> > [email protected]> wrote:
> >
> > > Hi,
> > > Below is a create statement on my Hudi dataset.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *CREATE EXTERNAL TABLE `inventory`.`customer`(`_hoodie_commit_time`
> string,
> > > `_hoodie_commit_seqno` string, `_hoodie_record_key` string,
> > > `_hoodie_partition_path` string, `_hoodie_file_name` string, `id`
> bigint,
> > > `sales` bigint, `merchant` bigint, `item_status` bigint, `tem_shipment`
> > > bigint)PARTITIONED BY (`dt` string)ROW FORMAT SERDE
> > > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH
> > > SERDEPROPERTIES ( 'serialization.format' = '1')STORED AS INPUTFORMAT
> > > 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT
> > > 'org.apache.hadoop.hive.ql.io
> .parquet.MapredParquetOutputFormat'LOCATION
> > > 's3://<warehouse-bucket>/<path>/inventory/customer'TBLPROPERTIES (
> > > 'bucketing_version' = '2', 'transient_lastDdlTime' = '1572952974',
> > > 'last_commit_time_sync' = '20191114192136')*
> > >
> > > I've taken care of adding *hudi-hive-bundle-0.5.1-SNAPSHOT.jar* in
> Hive,
> > > *hudi-presto-bundle-0.5.1-SNAPSHOT.jar* in Presto and
> > > *hudi-spark-bundle-0.5.1-SNAPSHOT.jar
> > > *in Spark (All three share common Metastore).
> > > We are running Hudi in COW mode and we noticed that there are multiple
> > > versions of the .parquet files
> > > written per partitions depending on number of updates coming to them
> over
> > > each batch execution. When queried from Hive and Presto
> > > for any Primary Key having multiple updates, it returns single record
> with
> > > latest state(I assume *HoodieParquetInputFormat* does the magic of
> taking
> > > care of duplicates). Whereas, when I tried to execute the same query
> > > in Spark SQL, I get duplicated records for any Primary Key having
> multiple
> > > updates.
> > >
> > > Can someone help me understand why Spark is not able to handle
> > > deduplication of records across multiple commits which Presto and Hive
> are
> > > able to do?
> > > I've taken care of providing hudi-spark-bundle-0.5.1-SNAPSHOT.jar while
> > > starting spark-shell. Is there something that I'm missing?
> > >
> > > Thanks in advance.
> > > Regards,
> > > Purushotham Pushpavanth
> >
> >
>
>

Re: Spark v2.3.2 : Duplicate entries found for each primary Key

Reply via email to