Re: About github issue 639

Vinoth Chandar Thu, 18 Apr 2019 07:29:38 -0700

Hi Jun,

Responses below.


>>1. Some file inflight may never reach commit?
yes. the next attempt at writing will first issue a rollback to clean up
such partial/leftover files first, before it begins the new commit.

>>2. In occasion which inflight and parquet file generated by inflight still
exists,  the global dedup will not dedup based on such kind file?
even if not rolled back, we check for the inflight parquet files against
the committed timeline, which it wont be a part of. So should be safe.


>>3. In occasion which inflight and parquet file generated by inflight still
exists,  the correct query result will be decided by read config(I
mean mapreduce.input.pathFilter.class
in sparksql)
yes. the filtering should work as well. its the same technique used by
writer.


>>4. Is there any way we can use

>
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> classOf[org.apache.hadoop.fs.PathFilter]);

in spark thrift server when start it?

I am not familiar with the Spark thrift server myself. Any pointers where I
can learn more?
Two suggestions :
- You can check if you can add this to the Hadoop configuration xml files
and see if it gets picked up by Spark?
- Alternatively, you can set the spark config mentioned here
http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro view
also), which should be doable I am assuming at this thrift server


Thanks
Vinoth


On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu <[email protected]> wrote:

> Hi,
> Link: https://github.com/apache/incubator-hudi/issues/639
> Sorry , failed open https://lists.apache.org/[email protected]
> .
> I have some follow up questions for issue 639:
>
> So, the sequence of events is . We write parquet files and then upon
> > successful writing of all attempted parquet files, we actually make the
> > commit as completed. (i.e not inflight anymore). So this is normal. This
> is
> > done to prevent queries from reading partially written parquet files..
> >
>
> Does that mean:
> 1. Some file inflight may never reach commit?
> 2. In occasion which inflight and parquet file generated by inflight still
> exists,  the global dedup will not dedup based on such kind file?
> 3. In occasion which inflight and parquet file generated by inflight still
> exists,  the correct query result will be decided by read config(I
> mean mapreduce.input.pathFilter.class
> in sparksql)
> 4. Is there any way we can use
>
> >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> > classOf[org.apache.hadoop.fs.PathFilter]);
>
> in spark thrift server when start it?
>
> Best,
> --
> [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
> Sr. Engineer I, Data
> ＋86 18565739171
>
> [image: in1552694272.png] <https://www.linkedin.com/company/vungle>
> [image:
> fb1552694203.png] <https://facebook.com/vungle>      [image:
> tw1552694330.png] <https://twitter.com/vungle>      [image:
> ig1552694392.png] <https://www.instagram.com/vungle>
> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>

Re: About github issue 639

Reply via email to